MATH50003 Numerical Analysis Lecture Notes

Original Author: Dr. Sheehan Olver Compiled by: Isaac Lee | Date: 02/03/2022 | Github Logo Icon Github Source

MATH50003 Numerical Analysis Lecture NotesWeek 0: Asymptotics and Computational Cost1. Asymptotics as Rules2. Asymptotics as 3. Computational costWeek 1: Numbers1. Binary representation2. IntegersSigned integerVariable bit representation (advanced)Division3. Floating-point numbersIEEE floating-point numbersSpecial normal numbersSpecial numbers4. ArithmeticBounding errors in floating point arithmeticArithmetic and special numbersSpecial functions (advanced)5. High-precision floating-point numbers (advanced)Week 2: Differentiation1. Finite-differencesDoes finite-differences work with floating point arithmetic?Bounding the error2. Dual numbers (Forward-mode automatic differentiation)Connection with differentiationImplementation as a special typeWeek 3: Structured Matrices1. Dense vectors and matrices2. Triangular matrices3. Banded matricesDiagonalBidiagonalTridiagonal4. Permutation Matrices4. Orthogonal matricesSimple rotationsReflectionsWeek 4: Decompositions and least squares1. QR and least squares2. Reduced QR and Gram–SchmidtGram–Schmidt in actionComplexity and stability3. Householder reflections and QR2. PLU DecompositionSpecial "one-column" lower triangular matricesLU DecompositionPLU Decomposition3. Cholesky Decomposition4. TimingsWeek 5: Singular values and conditioning1. Vector norms2. Matrix norms3. Singular value decomposition4. Condition numbersWeek 6: Differential Equations via Finite Differences1. Time-evolution problemsIndefinite integrationForward EulerBackward EulerSystems of equationsNonlinear problems2. Two-point boundary value problems3. ConvergencePoissonWeek 7: Fourier series1. Basics of Fourier series2. Trapezium rule and discrete Fourier coefficients3. Discrete Fourier transform and interpolation4. Fast Fourier Transform

Week 0: Asymptotics and Computational Cost

YouTube Lectures: Asymptotics and Computational Cost

We introduce Big-O, little-o and asymptotic notation and see how they can be used to describe computational cost.

  1. Asymptotics as n
  2. Asymptotics as xx0
  3. Computational cost

1. Asymptotics as n

Big-O, little-o, and "asymptotic to" are used to describe behaviour of functions at infinity.

Definition (Big-O)

f(n)=O(ϕ(n))(as n)

means

|f(n)ϕ(n)|

is bounded for sufficiently large n. That is, there exist constants C and N0 such that, for all nN0, |f(n)ϕ(n)|C.

Definition (little-O)

f(n)=o(ϕ(n))(as n)

means

limnf(n)ϕ(n)=0.

Definition (asymptotic to)

f(n)ϕ(n)(as n)

means

limnf(n)ϕ(n)=1.

Examples

cosnn21=O(n2)

as

|cosnn21n2||n2n21|2

for nN0=2.

logn=o(n)

as

limnlognn=0.
n2+1n2

as

n2+1n21.

Note we sometimes write f(O(ϕ(n))) for a function of the form f(g(n)) such that g(n)=O(ϕ(n)).

Rules

We have some simple algebraic rules:

Proposition (Big-O rules)

O(ϕ(n))O(ψ(n))=O(ϕ(n)ψ(n))(as n)O(ϕ(n))+O(ψ(n))=O(|ϕ(n)|+|ψ(n)|)(as n).

2. Asymptotics as xx0

We also have Big-O, little-o and "asymptotic to" at a point:

Definition (Big-O)

f(x)=O(ϕ(x))(as xx0)

means

|f(x)ϕ(x)|

is bounded in a neighbourhood of x0. That is, there exist constants C and r such that, for all 0|xx0|r, |f(x)ϕ(x)|C.

Definition (little-O)

f(x)=o(ϕ(x))(as xx0)

means

limxx0f(x)ϕ(x)=0.

Definition (asymptotic to)

f(x)ϕ(x)(as xx0)

means

limxx0f(x)ϕ(x)=1.

Example

expx=1+x+O(x2)as x0

Since

expx=1+x+expt2x2

for some t[0,x] and

|expt2x2x2|32

provided x1.

3. Computational cost

We will use Big-O notation to describe the computational cost of algorithms. Consider the following simple sum

k=1nxk2

which we might implement as:

Each step of this algorithm consists of one memory look-up (z = x[k]), one multiplication (w = z*z) and one addition (ret = ret + w). We will ignore the memory look-up in the following discussion. The number of CPU operations per step is therefore 2 (the addition and multiplication). Thus the total number of CPU operations is 2n. But the constant 2 here is misleading: we didn't count the memory look-up, thus it is more sensible to just talk about the asymptotic complexity, that is, the computational cost is O(n).

Now consider a double sum like:

k=1nj=1kxj2

which we might implement as:

Now the inner loop is O(1) operations (we don't try to count the precise number), which we do k times for O(k) operations as k. The outer loop therefore takes

k=1nO(k)=O(k=1nk)=O(n(n+1)2)=O(n2)

operations.

Week 1: Numbers

YouTube Lectures: Integers Floating Point Numbers Rounding Bouding Rounding Errors

Reference: Overton

In this chapter, we introduce the Two's-complement storage for integers and the IEEE Standard for Floating-Point Arithmetic. There are many possible ways of representing real numbers on a computer, as well as the precise behaviour of operations such as addition, multiplication, etc. Before the 1980s each processor had potentially a different representation for real numbers, as well as different behaviour for operations.
IEEE introduced in 1985 was a means to standardise this across processors so that algorithms would produce consistent and reliable results.

This chapter may seem very low level for a mathematics course but there are two important reasons to understand the behaviour of integers and floating-point numbers:

  1. Integer arithmetic can suddenly start giving wrong negative answers when numbers become large.
  2. Floating-point arithmetic is very precisely defined, and can even be used in rigorous computations as we shall see in the problem sheets. But it is not exact and its important to understand how errors in computations can accumulate.
  3. Failure to understand floating-point arithmetic can cause catastrophic issues in practice, with the extreme example being the explosion of the Ariane 5 rocket.

In this chapter we discuss the following:

  1. Binary representation: Any real number can be represented in binary, that is, by an infinite sequence of 0s and 1s (bits). We review binary representation.
  2. Integers: There are multiple ways of representing integers on a computer. We discuss the the different types of integers and their representation as bits, and how arithmetic operations behave like modular arithmetic. As an advanced topic we discuss BigInt, which uses variable bit length storage.
  3. Floating-point numbers: Real numbers are stored on a computer with a finite number of bits. There are three types of floating-point numbers: normal numbers, subnormal numbers, and special numbers.
  4. Arithmetic: Arithmetic operations in floating-point are exact up to rounding, and how the rounding mode can be set. This allows us to bound errors computations.
  5. High-precision floating-point numbers: As an advanced topic, we discuss how the precision of floating-point arithmetic can be increased arbitrary using BigFloat.

Before we begin, we load two external packages. SetRounding.jl allows us to set the rounding mode of floating-point arithmetic. ColorBitstring.jl implements functions printbits (and printlnbits) which print the bits (and with a newline) of floating-point numbers in colour.

 

1. Binary representation

Any integer can be presented in binary format, that is, a sequence of 0s and 1s.

Definition For B0,,Bp{0,1} denote a non-negative integer in binary format by:

(BpB1B0)2:=2pBp++2B1+B0

For b1,b2,{0,1}, Denote a non-negative real number in binary format by:

(BpB0.b1b2b3)2=(BpB0)2+b12+b222+b323+

 

First we show some examples of verifying a numbers binary representation:

Example (integer in binary) A simple integer example is 5=22+20=(101)2.

Example (rational in binary) Consider the number 1/3. In decimal recall that:

1/3=0.3333=k=1310k

We will see that in binary

1/3=(0.010101)2=k=1122k

Both results can be proven using the geometric series:

k=0zk=11z

provided |z|<1. That is, with z=14 we verify the binary expansion:

k=114k=111/41=13

A similar argument with z=1/10 shows the decimal case.

2. Integers

On a computer one typically represents integers by a finite number of p bits, with 2p possible combinations of 0s and 1s. For unsigned integers (non-negative integers) these bits are just the first p binary digits: (Bp1B1B0)2.

Integers on a computer follow modular arithmetic:

Definition (ring of integers modulo m) Denote the ring

Zm:={0 (mod m),1 (mod m),,m1 (mod m)}

Integers represented with p-bits on a computer actually represent elements of Z2p and integer arithmetic on a computer is equivalent to arithmetic modulo 2p.

Example (addition of 8-bit unsigned integers) Consider the addition of two 8-bit numbers:

255+1=(11111111)2+(00000001)2=(100000000)2=256

The result is impossible to store in just 8-bits! It is way too slow for a computer to increase the number of bits, or to throw an error (checks are slow). So instead it treats the integers as elements of Z256:

255+1 (mod 256)=(00000000)2 (mod 256)=0 (mod 256)

We can see this in Julia:

Example (multiplication of 8-bit unsigned integers) Multiplication works similarly: for example,

2542 (mod 256)=252 (mod 256)=(11111100)2 (mod 256)

We can see this behaviour in code by printing the bits:

Signed integer

Signed integers use the Two's complemement convention. The convention is if the first bit is 1 then the number is negative: the number 2py is interpreted as y. Thus for p=8 we are interpreting 27 through 281 as negative numbers.

Example (converting bits to signed integers) What 8-bit integer has the bits 01001001? Adding the corresponding decimal places we get:

What 8-bit (signed) integer has the bits 11001001? Because the first bit is 1 we know it's a negative number, hence we need to sum the bits but then subtract 2^p:

We can check the results using printbits:

Arithmetic works precisely the same for signed and unsigned integers.

Example (addition of 8-bit integers) Consider (-1) + 1 in 8-bit arithmetic. The number 1 has the same bits as 281=255. Thus this is equivalent to the previous question and we get the correct result of 0. In other words:

1+1 (mod 2p)=2p1+1 (mod 2p)=2p (mod 2p)=0 (mod 2p)

Example (multiplication of 8-bit integers) Consider (-2) * 2. 2 has the same bits as 22562=254 and 4 has the same bits as 22564=252, and hence from the previous example we get the correct result of -4. In other words:

(2)2 (mod 2p)=(2p2)2 (mod 2p)=2p+14 (mod 2p)=4 (mod 2p)

 

 

Example (overflow) We can find the largest and smallest instances of a type using typemax and typemin:

As explained, due to modular arithmetic, when we add 1 to the largest 8-bit integer we get the smallest:

This behaviour is often not desired and is known as overflow, and one must be wary of using integers close to their largest value.

Variable bit representation (advanced)

An alternative representation for integers uses a variable number of bits, with the advantage of avoiding overflow but with the disadvantage of a substantial speed penalty. In Julia these are BigInts, which we can create by calling big on an integer:

Note in this case addition automatically promotes an Int64 to a BigInt. We can create very large numbers using BigInt:

Note the number of bits is not fixed, the larger the number, the more bits required to represent it, so while overflow is impossible, it is possible to run out of memory if a number is astronomically large: go ahead and try x^x (at your own risk).

Division

In addition to +, -, and * we have integer division ÷, which rounds down:

Standard division / (or \ for division on the right) creates a floating-point number, which will be discussed shortly:

We can also create rational numbers using //:

Rational arithmetic often leads to overflow so it is often best to combine big with rationals:

3. Floating-point numbers

Floating-point numbers are a subset of real numbers that are representable using a fixed number of bits.

Definition (floating-point numbers) Given integers σ (the "exponential shift") Q (the number of exponent bits) and S (the precision), define the set of Floating-point numbers by dividing into normal, sub-normal, and special number subsets:

Fσ,Q,S:=Fσ,Q,SnormalFσ,Q,SsubFspecial.

The normal numbers Fσ,Q,SnormalR are defined by

Fσ,Q,Snormal={±2qσ×(1.b1b2b3bS)2:1q<2Q1}.

The sub-normal numbers Fσ,Q,SsubR are defined as

Fσ,Q,Ssub={±21σ×(0.b1b2b3bS)2}.

The special numbers FspecialR are defined later.

Note this set of real numbers has no nice algebraic structure: it is not closed under addition, subtraction, etc. We will therefore need to define approximate versions of algebraic operations later.

Floating-point numbers are stored in 1+Q+S total number of bits, in the format

sqQ1q0b1bS

The first bit (s) is the sign bit: 0 means positive and 1 means negative. The bits qQ1q0 are the exponent bits: they are the binary digits of the unsigned integer q:

q=(qQ1q0)2.

Finally, the bits b1bS are the significand bits. If 1q<2Q1 then the bits represent the normal number

x=±2qσ×(1.b1b2b3bS)2.

If q=0 (i.e. all bits are 0) then the bits represent the sub-normal number

x=±21σ×(0.b1b2b3bS)2.

If q=2Q1 (i.e. all bits are 1) then the bits represent a special number, discussed later.

IEEE floating-point numbers

Definition (IEEE floating-point numbers) IEEE has 3 standard floating-point formats: 16-bit (half precision), 32-bit (single precision) and 64-bit (double precision) defined by:

F16:=F15,5,10F32:=F127,8,23F64:=F1023,11,52

In Julia these correspond to 3 different floating-point types:

  1. Float64 is a type representing double precision (F64). We can create a Float64 by including a decimal point when writing the number: 1.0 is a Float64. Float64 is the default format for scientific computing (on the Floating-Point Unit, FPU).
  2. Float32 is a type representing single precision (F32). We can create a Float32 by including a f0 when writing the number: 1f0 is a Float32. Float32 is generally the default format for graphics (on the Graphics Processing Unit, GPU), as the difference between 32 bits and 64 bits is indistinguishable to the eye in visualisation, and more data can be fit into a GPU's limitted memory.
  3. Float16 is a type representing half-precision (F16). It is important in machine learning where one wants to maximise the amount of data and high accuracy is not necessarily helpful.

Example (rational in Float32) How is the number 1/3 stored in Float32? Recall that

1/3=(0.010101)2=22(1.0101)2=2125127(1.0101)2

and since 125=(1111101)2 the exponent bits are 01111101. . For the significand we round the last bit to the nearest element of F32, (this is explained in detail in the section on rounding), so we have

1.0101010101010101010101011.01010101010101010101011F32

and the significand bits are 01010101010101010101011. Thus the Float32 bits for 1/3 are:

For sub-normal numbers, the simplest example is zero, which has q=0 and all significand bits zero:

Unlike integers, we also have a negative zero:

This is treated as identical to 0.0 (except for degenerate operations as explained in special numbers).

 

Special normal numbers

When dealing with normal numbers there are some important constants that we will use to bound errors.

Definition (machine epsilon/smallest positive normal number/largest normal number) Machine epsilon is denoted

ϵm,S:=2S.

When S is implied by context we use the notation ϵm. The smallest positive normal number is q=1 and bk all zero:

min|Fσ,Q,Snormal|=21σ

where |A|:={|x|:xA}. The largest (positive) normal number is

maxFσ,Q,Snormal=22Q2σ(1.111)2=22Q2σ(2ϵm)

We confirm the simple bit representations:

For a given floating-point type, we can find these constants using the following functions:

Example (creating a sub-normal number) If we divide the smallest normal number by two, we get a subnormal number:

Can you explain the bits?

 

Special numbers

The special numbers extend the real line by adding ± but also a notion of "not-a-number".

Definition (not a number) Let NaN represent "not a number" and define

Fspecial:={,,NaN}

Whenever the bits of q of a floating-point number are all 1 then they represent an element of Fspecial. If all bk=0, then the number represents either ±, called Inf and -Inf for 64-bit floating-point numbers (or Inf16, Inf32 for 16-bit and 32-bit, respectively):

All other special floating-point numbers represent NaN. One particular representation of NaN is denoted by NaN for 64-bit floating-point numbers (or NaN16, NaN32 for 16-bit and 32-bit, respectively):

These are needed for undefined algebraic operations such as:

Example (many NaNs) What happens if we change some other bk to be nonzero? We can create bits as a string and see:

Thus, there are more than one NaNs on a computer.

4. Arithmetic

Arithmetic operations on floating-point numbers are exact up to rounding. There are three basic rounding strategies: round up/down/nearest. Mathematically we introduce a function to capture the notion of rounding:

Definition (rounding) flσ,Q,Sup:RFσ,Q,S denotes the function that rounds a real number up to the nearest floating-point number that is greater or equal. flσ,Q,Sdown:RFσ,Q,S denotes the function that rounds a real number down to the nearest floating-point number that is greater or equal. flσ,Q,Snearest:RFσ,Q,S denotes the function that rounds a real number to the nearest floating-point number. In case of a tie, it returns the floating-point number whose least significant bit is equal to zero. We use the notation fl when σ,Q,S and the rounding mode are implied by context, with flnearest being the default rounding mode.

 

In Julia, the rounding mode is specified by tags RoundUp, RoundDown, and RoundNearest. (There are also more exotic rounding strategies RoundToZero, RoundNearestTiesAway and RoundNearestTiesUp that we won't use.)

 

WARNING (rounding performance, advanced) These rounding modes are part of the FPU instruction set so will be (roughly) equally fast as the default, RoundNearest. Unfortunately, changing the rounding mode is expensive, and is not thread-safe.

 

Let's try rounding a Float64 to a Float32.

The default rounding mode can be changed:

Or alternatively we can change the rounding mode for a chunk of code using setrounding. The following computes upper and lower bounds for /:

WARNING (compiled constants, advanced): Why did we first create a variable x instead of typing 1f0/3? This is due to a very subtle issue where the compiler is too clever for it's own good: it recognises 1f0/3 can be computed at compile time, but failed to recognise the rounding mode was changed.

In IEEE arithmetic, the arithmetic operations +, -, *, / are defined by the property that they are exact up to rounding. Mathematically we denote these operations as follows:

xy:=fl(x+y)xy:=fl(xy)xy:=fl(xy)xy:=fl(x/y)

Note also that ^ and sqrt are similarly exact up to rounding.

Example (decimal is not exact) 1.1+0.1 gives a different result than 1.2:

This is because fl(1.1)1+1/10, but rather:

fl(1.1)=1+24+25+28+29++248+249+251

WARNING (non-associative) These operations are not associative! E.g. (xy)z is not necessarily equal to x(yz). Commutativity is preserved, at least. Here is a surprising example of non-associativity:

Can you explain this in terms of bits?

Bounding errors in floating point arithmetic

Before we dicuss bounds on errors, we need to talk about the two notions of errors:

Definition (absolute/relative error) If x~=x+δrma=x(1+δr) then |δa| is called the absolute error and |δr| is called the relative error in approximating x by x~.

We can bound the error of basic arithmetic operations in terms of machine epsilon, provided a real number is close to a normal number:

Definition (normalised range) The normalised range Nσ,Q,SR is the subset of real numbers that lies between the smallest and largest normal floating-point number:

Nσ,Q,S:={x:min|Fσ,Q,S||x|maxFσ,Q,S}

When σ,Q,S are implied by context we use the notation N.

We can use machine epsilon to determine bounds on rounding:

Proposition (rounding arithmetic) If xN then

flmode(x)=x(1+δxmode)

where the relative error is

|δxnearest|ϵm2|δxup/down|<ϵm.

This immediately implies relative error bounds on all IEEE arithmetic operations, e.g., if x+yN then we have

xy=(x+y)(1+δ1)

where (assuming the default nearest rounding) |δ1|ϵm2.

Example (bounding a simple computation) We show how to bound the error in computing

(1.1+1.2)+1.3

using floating-point arithmetic. First note that 1.1 on a computer is in fact fl(1.1). Thus this computation becomes

(fl(1.1)fl(1.2))fl(1.3)

First we find

(fl(1.1)fl(1.2))=(1.1(1+δ1)+1.2(1+δ2))(1+δ3)=2.3+1.1δ1+1.2δ2+2.3δ3+1.1δ1δ3+1.2δ2δ3=2.3+δ4

where (note δ1δ3 and δ2δ3 are tiny so we just round up our bound to the nearest decimal)

|δ4|2.3ϵm

Thus the computation becomes

((2.3+δ4)+1.3(1+δ5))(1+δ6)=3.6+δ4+1.3δ5+3.6δ6+δ4δ6+1.3δ5δ6=3.6+δ7

where the absolute error is

|δ7|4.8ϵm

Indeed, this bound is bigger than the observed error:

Arithmetic and special numbers

Arithmetic works differently on Inf and NaN and for undefined operations. In particular we have:

Special functions (advanced)

Other special functions like cos, sin, exp, etc. are not part of the IEEE standard. Instead, they are implemented by composing the basic arithmetic operations, which accumulate errors. Fortunately many are designed to have relative accuracy, that is, s = sin(x) (that is, the Julia implementation of sinx) satisfies

s=(sinx)(1+δ)

where |δ|<cϵm for a reasonably small c>0, provided that xFnormal. Note these special functions are written in (advanced) Julia code, for example, sin.

WARNING (sin(fl(x)) is not always close to sin(x)) This is possibly a misleading statement when one thinks of x as a real number. Consider x=π so that sinx=0. However, as fl(π)π. Thus we only have relative accuracy compared to the floating point approximation:

Another issue is when x is very large:

But if we instead compute 10^100 using BigFloat we get a completely different answer that even has the wrong sign!

This is because we commit an error on the order of roughly

210100ϵm4.441084

when we round 210100 to the nearest float.

Example (polynomial near root) For general functions we do not generally have relative accuracy. For example, consider a simple polynomial 1+4x+x2 which has a root at 32. But

We can see this in the error bound (note that 4x is exact for floating point numbers and adding 1 is exact for this particular x):

(xx4x)+1=(x2(1+δ1)+4x)(1+δ2)+1=x2+4x+1+δ1x2+4xδ2+x2δ1δ2

Using a simple bound |x|<1 we get a (pessimistic) bound on the absolute error of 3ϵm. Here f(x) itself is less than 2ϵm so this does not imply relative accuracy. (Of course, a bad upper bound is not the same as a proof of inaccuracy, but here we observe the inaccuracy in practice.)

 

 

5. High-precision floating-point numbers (advanced)

It is possible to set the precision of a floating-point number using the BigFloat type, which results from the usage of big when the result is not an integer. For example, here is an approximation of 1/3 accurate to 77 decimal digits:

Note we can set the rounding mode as in Float64, e.g., this gives (rigorous) bounds on 1/3:

We can also increase the precision, e.g., this finds bounds on 1/3 accurate to more than 1000 decimal places:

In the problem sheet we shall see how this can be used to rigorously bound e, accurate to 1000 digits.


Week 2: Differentiation

YouTube Lectures: Finite Differences Dual Numbers

We now get to our first computational problem: given a function, how can we approximate its derivative at a point? Before we begin, we must be clear what a "function" is. Consider three possible scenarios:

  1. Black-box function: Consider a floating-point valued function fFP:DF where DFFσ,Q,S (e.g., we are given a double precision function that takes in a Float64 and returns another Float64) which we only know pointwise. This is the situation if we have a function that relies on a compiled C library, which composes floating point arithmetic operations. Since F is a discrete set such an fFP cannot be differentiable in a rigorous way, therefore we need to assume that fFP approximates a differentiable function f with controlled error in order to state anything precise.
  2. Generic function: Consider a function that is a formula (or, equivalentally, a piece of code) that we can evaluate it on arbitrary types, including custom types that we create. An example is a polynomial: p(x)=p0+p1x++pnxn which can be evaluated for x in the reals, complexes, or any other ring. More generally, if we have a function defined in Julia that does not call any C libraries it can be evaluated on different types. For analysis we typically consider both a differentiable function f:DR for DR, which would be what one would have if we could evaluate a function exactly using real arithmetic, and fFP:DFF, which is what we actually compute when evaluating the function using floating point arithmetic.
  3. Graph function: The function is built by composing different basic "kernels" with known differentiability properties. We won't consider this situation in this module, though it is the model used by Python machine learning toolbox's like PyTorch and TensorFlow.

We discuss the following techniques:

  1. Finite-differences: Use the definition of a derivative that one learns in calculus to approximate its value. Unfortunately, the round-off errors of floating point arithmetic typically limit its accuracy.
  2. Dual numbers (forward-mode automatic differentiation): Define a special type that when applied to a function computes its derivative. Mathematically, this uses dual numbers, which are analoguous to complex numbers.

Note there are other techniques for differentiation that we don't discuss:

  1. Symbolic differentiation: A tree is built representing a formula which is differentiated using the product and chain rule.
  2. Adjoints and back-propagation (reverse-mode automatic differentiation): This is similar to symbolic differentiation but automated, to build up a tape of operations that tracks interdependencies. It's outside the scope of this module but is computationally preferred for computing gradients of large dimensional functions which is critical in machine learning.
  3. Interpolation and differentiation: We can also differentiate functions globally, that is, in an interval instead of only a single point, which will be discussed later in the module.

1. Finite-differences

The definition

f(x)=limh0f(x+h)f(x)h

tells us that

f(x)f(x+h)f(x)h

provided that h is sufficiently small.

It's important to note that approximation uses only the black-box notion of a function but to obtain bounds we need more.

If we know a bound on f(x) then Taylor's theorem tells us a precise bound:

Proposition The error in approximating the derivative using finite differences is

|f(x)f(x+h)f(x)h|M2h

where M=supxtx+h|f(t)|.

Proof Follows immediately from Taylor's theorem:

f(x+h)=f(x)+f(x)h+f(t)2h2

for some xtx+h.

◼️

 

There are also alternative versions of finite differences. Leftside finite-differences:

f(x)f(x)f(xh)h

and central differences:

f(x)f(x+h)f(xh)2h

Composing these approximations is useful for higher-order derivatives as we discuss in the problem sheet.

Note this is assuming real arithmetic, the answer is drastically different with floating point arithmetic.

Does finite-differences work with floating point arithmetic?

Let's try differentiating two simple polynomials f(x)=1+x+x2 and g(x)=1+x/3+x2 by applying the finite-difference approximation to their floating point implementations fFP and gFP:

Both seem to roughly approximate the true derivatives (1 and 1/3). We can do a plot to see how fast the error goes down as we let h become small.

In the case of f it is a success: we approximate the true derivative exactly provided we take h=2n for 26<n52. But for g it is a huge failure: the approximation starts to converge, but then diverges exponentially fast, before levelling off!

It is clear that f is extremely special. Most functions will behave like g, and had we not taken h to be a power of two we also see divergence for differentiating f:

For these two simple examples, we can understand why we see very different behaviour.

Example (convergence(?) of finite difference) Consider differentiating f(x)=1+x+x2 at 0 with h=2n. We consider 3 different cases with different behaviour, where S is the number of significand bits:

  1. 0nS/2
  2. S/2<nS
  3. Sn

Note that fFP(0)=f(0)=1. Thus we wish to understand the error in approximating f(0)=1 by

(fFP(h)1)hwherefFP(x)=1xxx.

Case 1 (0nS/2): note that fFP(h)=f(h)=1+2n+22n as each computation is precisely a floating point number (hence no rounding). We can see this in half-precision, with n=3 we have a 1 in the 3rd and 6th decimal place:

Subtracting 1 and dividing by h will also be exact, hence we get

(fFP(h)1)h=1+2n

which shows exponential convergence.

Case 2 (S/2<nS): Now we have (using round-to-nearest)

fFP(h)=(1+2n)22n=1+2n

Then

(fFP(h)1)h=1=f(0)

We have actually performed better than true real arithmetic and converged without a limit!

Case 3 (n>S): If we take n too large, then 1h=1 and we have fFP(h)=1, that is and

(fFP(h)1)h=0f(0)

Example (divergence of finite difference) Consider differentiating g(x)=1+x/3+x2 at 0 with h=2n and assume n is even for simplicity and consider half-precision with S=10. Note that gFP(0)=g(0)=1. Recall

h3=2n2(1.0101010101)2

Note we lose two bits each time in the computation of 1(h3):

It follows if S/2<n<S that

1(h3)=1+h/3210/3

Therefore

(gFP(h)1)h=1/32n10/3

Thus the error grows exponentially with n.

If nS then 1(h3)=1 and we have

(gFP(h)1)h=0

Bounding the error

We can bound the error using the bounds on floating point arithmetic.

Theorem (finite-difference error bound) Let f be twice-differentiable in a neighbourhood of x and assume that fFP(x)=f(x)+δxf has uniform absolute accuracy in that neighbourhood, that is:

|δxf|cϵm

for a fixed constant c. Assume for simplicity h=2n where nS and |x|1. Then the finite-difference approximation satisfies

(fFP(x+h)fFP(x))h=f(x)+δx,hFD

where

|δx,hFD||f(x)|2ϵm+Mh+4cϵmh

for M=supxtx+h|f(t)|.

Proof

We have (noting by our assumptions xh=x+h and that dividing by h will only change the exponent so is exact)

(fFP(x+h)fFP(x))h=f(x+h)+δx+hff(x)δxfh(1+δ1)=f(x+h)f(x)h(1+δ1)+δx+hfδxfh(1+δ1)

where |δ1|ϵm/2. Applying Taylor's theorem we get

(fFP(x+h)fFP(x))h=f(x)+f(x)δ1+f(t)2h(1+δ1)+δx+hfδxfh(1+δ1)δx,hFD

The bound then follows, using the very pessimistic bound |1+δ1|2.

The three-terms of this bound tell us a story: the first term is a fixed (small) error, the second term tends to zero as h0, while the last term grows like ϵm/h as h0. Thus we observe convergence while the second term dominates, until the last term takes over. Of course, a bad upper bound is not the same as a proof that something grows, but it is a good indication of what happens in general and suffices to motivate the following heuristic to balance the two sources of errors:

Heuristic (finite-difference with floating-point step) Choose h proportional to ϵm in finite-differences.

In the case of double precision ϵm1.5×108, which is close to when the observed error begins to increase in our examples.

 

Remark: While finite differences is of debatable utility for computing derivatives, it is extremely effective in building methods for solving differential equations, as we shall see later. It is also very useful as a "sanity check" if one wants something to compare with for other numerical methods for differentiation.

2. Dual numbers (Forward-mode automatic differentiation)

Automatic differentiation consists of applying functions to special types that determine the derivatives. Here we do so via dual numbers.

Definition (Dual numbers) Dual numbers D are a commutative ring over the reals generated by 1 and ϵ such that ϵ2=0. Dual numbers are typically written as a+bϵ where a and b are real.

This is very much analoguous to complex numbers, which are a field generated by 1 and i such that i2=1. Compare multiplication of each number type:

(a+bi)(c+di)=ac+(bc+ad)i+bdi2=acbd+(bc+ad)i(a+bϵ)(c+dϵ)=ac+(bc+ad)ϵ+bdϵ2=ac+(bc+ad)ϵ

And just as we view RC by equating aR with a+0iC, we can view RD by equating aR with a+0ϵD.

 

Connection with differentiation

Applying a polynomial to a dual number a+bϵ tells us the derivative at a:

Theorem (polynomials on dual numbers) Suppose p is a polynomial. Then

p(a+bϵ)=p(a)+bp(a)ϵ

Proof

It suffices to consider p(x)=xn for n1 as other polynomials follow from linearity. We proceed by induction: The case n=1 is trivial. For n>1 we have

(a+bϵ)n=(a+bϵ)(a+bϵ)n1=(a+bϵ)(an1+(n1)ban2ϵ)=an+bnan1ϵ.

We can extend real-valued differentiable functions to dual numbers in a similar manner. First, consider a standard function with a Taylor series (e.g. cos, sin, exp, etc.)

f(x)=k=0fkxk

so that a is inside the radius of convergence. This leads naturally to a definition on dual numbers:

f(a+bϵ)=k=0fk(a+bϵ)k=k=0fk(ak+kak1bϵ)=k=0fkak+k=0fkkak1bϵ=f(a)+bf(a)ϵ

More generally, given a differentiable function we can extend it to dual numbers:

Definition (dual extension) Suppose a real-valued function f is differentiable at a. If

f(a+bϵ)=f(a)+bf(a)ϵ

then we say that it is a dual extension at a.

Thus, for basic functions we have natural extensions:

exp(a+bϵ):=exp(a)+bexp(a)ϵsin(a+bϵ):=sin(a)+bcos(a)ϵcos(a+bϵ):=cos(a)bsin(a)ϵlog(a+bϵ):=log(a)+baϵa+bϵ:=a+b2aϵ|a+bϵ|:=|a|+bsignaϵ

provided the function is differentiable at a. Note the last example does not have a convergent Taylor series (at 0) but we can still extend it where it is differentiable.

Going further, we can add, multiply, and compose such functions:

Lemma (product and chain rule) If f is a dual extension at g(a) and g is a dual extension at a, then q(x):=f(g(x)) is a dual extension at a. If f and g are dual extensions at a then r(x):=f(x)g(x) is also dual extensions at a. In other words:

q(a+bϵ)=q(a)+bq(a)ϵr(a+bϵ)=r(a)+br(a)ϵ

Proof For q it follows immediately:

q(a+bϵ)=f(g(a+bϵ))=f(g(a)+bg(a)ϵ)=f(g(a))+bg(a)f(g(a))ϵ=q(a)+bq(a)ϵ.

For r we have

r(a+bϵ)=f(a+bϵ)g(a+bϵ)=(f(a)+bf(a)ϵ)(g(a)+bg(a)ϵ)=f(a)g(a)+b(f(a)g(a)+f(a)g(a))ϵ=r(a)+br(a)ϵ.

A simple corollary is that any function defined in terms of addition, multiplication, composition, etc. of functions that are dual with differentiation will be differentiable via dual numbers.

Example (differentiating non-polynomial)

Consider f(x)=exp(x2+ex) by evaluating on the duals:

f(1+ϵ)=exp(1+2ϵ+e+eϵ)=exp(1+e)+exp(1+e)(2+e)ϵ

and therefore we deduce that

f(1)=exp(1+e)(2+e).

Implementation as a special type

We now consider a simple implementation of dual numbers that works on general polynomials:

We can also try it on the two polynomials as above:

The first example exactly computes the derivative, and the second example is exact up to the last bit rounding! It also works for higher order polynomials:

It is indeed "accurate to (roughly) 16-digits", the best we can hope for using floating point.

We can use this in "algorithms" as well as simple polynomials. Consider the polynomial 1++xn:

This matches exactly the "true" (up to rounding) derivative:

Finally, we can try the more complicated example:

What makes dual numbers so effective is that, unlike finite differences, they are not prone to disasterous growth due to round-off errors.


Week 3: Structured Matrices

YouTube Lectures: Structured Matrices Permutation Matrices Orthogonal Matrices

We have seen how algebraic operations (+, -, *, /) are well-defined for floating point numbers. Now we see how this allows us to do (approximate) linear algebra operations on structured matrices. That is, we consider the following structures:

  1. Dense: This can be considered unstructured, where we need to store all entries in a vector or matrix. Matrix multiplication reduces directly to standard algebraic operations. Solving linear systems with dense matrices will be discussed later.
  2. Triangular: If a matrix is upper or lower triangular, we can immediately invert using back-substitution. In practice we store a dense matrix and ignore the upper/lower entries.
  3. Banded: If a matrix is zero apart from entries a fixed distance from the diagonal it is called banded and this allows for more efficient algorithms. We discuss diagonal, tridiagonal and bidiagonal matrices.
  4. Permutation: A permutation matrix permutes the rows of a vector.
  5. Orthogonal: An orthogonal matrix Q satisfies QQ=I, in other words, they are very easy to invert. We discuss the special cases of simple rotations and reflections.

1. Dense vectors and matrices

A Vector of a primitive type (like Int or Float64) is stored consecutively in memory. E.g. if we have a Vector{Int8} of length n then it is stored as 8n bits (n bytes) in a row. A Matrix is stored consecutively in memory, going down column-by- column. That is,

Is actually stored equivalently to a length 6 vector:

This is known as column-major format.

Remark: Note that transposing A is done lazyily and so A' stores the entries by row. That is, A' is stored in row-major format.

Matrix-vector multiplication works as expected:

Note there are two ways this can be implemented: either the traditional definition, going row-by-row:

[j=1na1,jxjj=1nam,jxj]

or going column-by-column:

x1𝐚1++xn𝐚n

It is easy to implement either version of matrix-multiplication in terms of the algebraic operations we have learned, in this case just using integer arithmetic:

Either implementation will be O(mn) operations. However, the implementation mul accesses the entries of A going down the column, which happens to be significantly faster than mul_rows, due to accessing memory of A in order. We can see this by measuring the time it takes using @btime:

Here ms means milliseconds (0.001 = 10^(-3) seconds) and μs means microseconds (0.000001 = 10^(-6) seconds). So we observe that mul is roughly 3x faster than mul_rows, while the optimised * is roughly 5x faster than mul.

Remark: (advanced) For floating point types, A*x is implemented in BLAS which is generally multi-threaded and is not identical to mul(A,x), that is, some inputs will differ in how the computations are rounded.

Note that the rules of arithmetic apply here: matrix multiplication with floats will incur round-off error (the precise details of which are subject to the implementation):

And integer arithmetic will be prone to overflow:

Solving a linear system is done using \:

Despite the answer being integer-valued, here we see that it resorted to using floating point arithmetic, incurring rounding error. But it is "accurate to (roughly) 16-digits". As we shall see, the way solving a linear system works is we first write A as a product of simpler matrices, e.g., a product of triangular matrices.

Remark: (advanced) For floating point types, A \ x is implemented in LAPACK, which like BLAS is generally multi-threaded and in fact different machines may round differently.

2. Triangular matrices

Triangular matrices are represented by dense square matrices where the entries below the diagonal are ignored:

We can see that U is storing all the entries of A:

Similarly we can create a lower triangular matrix by ignoring the entries above the diagonal:

If we know a matrix is triangular we can do matrix-vector multiplication in roughly half the number of operations. Moreover, we can easily invert matrices. Consider a simple 3×3 example, which can be solved with \:

Behind the seens, \ is doing back-substitution: considering the last row, we have all zeros apart from the last column so we know that x[3] must be equal to:

Once we know x[3], the second row states U[2,2]*x[2] + U[2,3]*x[3] == b[2], rearranging we get that x[2] must be:

Finally, the first row states U[1,1]*x[1] + U[1,2]*x[2] + U[1,3]*x[3] == b[1] i.e. x[1] is equal to

More generally, we can solve the upper-triangular system

[u11u1nunn][x1xn]=[b1bn]

by computing xn,xn1,,x1 by the back-substitution formula:

xk=bkj=k+1nukjxjukk

 

The problem sheet will explore implementing this method, as well as forward substitution for inverting lower triangular matrices. The cost of multiplying and solving linear systems with a triangular matrix is O(n2).

3. Banded matrices

A banded matrix is zero off a prescribed number of diagonals. We call the number of (potentially) non-zero diagonals the bandwidths:

Definition (bandwidths) A matrix A has lower-bandwidth l if A[k,j]=0 for all kj>l and upper-bandwidth u if A[k,j]=0 for all jk>u. We say that it has strictly lower-bandwidth l if it has lower-bandwidth l and there exists a j such that A[j+l,j]0. We say that it has strictly upper-bandwidth u if it has upper-bandwidth u and there exists a k such that A[k,k+u]0.

Diagonal

Definition (Diagonal) Diagonal matrices are square matrices with bandwidths l=u=0.

Diagonal matrices in Julia are stored as a vector containing the diagonal entries:

It is clear that we can perform diagonal-vector multiplications and solve linear systems involving diagonal matrices efficiently (in O(n) operations).

Bidiagonal

Definition (Bidiagonal) If a square matrix has bandwidths (l,u)=(1,0) it is lower-bidiagonal and if it has bandwidths (l,u)=(0,1) it is upper-bidiagonal.

We can create Bidiagonal matrices in Julia by specifying the diagonal and off-diagonal:

Multiplication and solving linear systems with Bidiagonal systems is also O(n) operations, using the standard multiplications/back-substitution algorithms but being careful in the loops to only access the non-zero entries.

Tridiagonal

Definition (Tridiagonal) If a square matrix has bandwidths l=u=1 it is tridiagonal.

Julia has a type Tridiagonal for representing a tridiagonal matrix from its sub-diagonal, diagonal, and super-diagonal:

Tridiagonal matrices will come up in second-order differential equations and orthogonal polynomials. We will later see how linear systems involving tridiagonal matrices can be solved in O(n) operations.

4. Permutation Matrices

Permutation matrices are matrices that represent the action of permuting the entries of a vector, that is, matrix representations of the symmetric group Sn, acting on n. Recall every σSn is a bijection between {1,2,,n} and itself. We can write a permutation σ in Cauchy notation:

(123nσ1σ2σ3σn)

where {σ1,,σn}={1,2,,n} (that is, each integer appears precisely once). We denote the inverse permutation by σ1, which can be constructed by swapping the rows of the Cauchy notation and reordering.

We can encode a permutation in vector σ=[σ1,,σn]. This induces an action on a vector (using indexing notation)

𝐯[σ]=[vσ1vσn]

Example (permutation of a vector) Consider the permutation σ given by

(1234514253)

We can apply it to a vector:

Its inverse permutation σ1 has Cauchy notation coming from swapping the rows of the Cauchy notation of σ and sorting:

(1425312345)(1243513254)

Julia has the function invperm for computing the vector that encodes the inverse permutation: And indeed:

And indeed permuting the entries by σ and then by σ⁻¹ returns us to our original vector:

 

Note that the operator

Pσ(𝐯)=𝐯[σ]

is linear in 𝐯, therefore, we can identify it with a matrix whose action is:

Pσ[v1vn]=[vσ1vσn].

The entries of this matrix are

Pσ[k,j]=𝐞kPσ𝐞j=𝐞k𝐞σj1=δk,σj1=δσk,j

where δk,j is the Kronecker delta:

δk,j:={1k=j0otherwise.

This construction motivates the following definition:

Definition (permutation matrix) Pn×n is a permutation matrix if it is equal to the identity matrix with its rows permuted.

Example (5×5 permutation matrix) We can construct the permutation representation for σ as above as follows:

And indeed, we see its action is as expected:

Remark: (advanced) Note that P is a special type SparseMatrixCSC. This is used to represent a matrix by storing only the non-zero entries as well as their location. This is an important data type in high-performance scientific computing, but we will not be using general sparse matrices in this module.

Proposition (permutation matrix inverse) Let Pσ be a permutation matrix corresponding to the permutation σ. Then

Pσ=Pσ1=Pσ1

That is, Pσ is orthogonal:

PσPσ=PσPσ=I.

Proof

We prove orthogonality via:

𝐞kPσPσ𝐞j=(Pσ𝐞k)Pσ𝐞j=𝐞σk1𝐞σj1=δk,j

This shows PσPσ=I and hence Pσ1=Pσ.

Permutation matrices are examples of sparse matrices that can be very easily inverted.

 

4. Orthogonal matrices

Definition (orthogonal matrix) A square matrix is orthogonal if its inverse is its transpose:

QQ=QQ=I.

We have already seen an example of an orthogonal matrices (permutation matrices). Here we discuss two important special cases: simple rotations and reflections.

Simple rotations

Definition (simple rotation) A 2×2 rotation matrix through angle θ is

Qθ:=[cosθsinθsinθcosθ]

In what follows we use the following for writing the angle of a vector:

Definition (two-arg arctan) The two-argument arctan function gives the angle θ through the point [a,b], i.e.,

a2+b2[cosθsinθ]=[ab]

It can be defined in terms of the standard arctan as follows:

atan(b,a):={atanbaa>0atanba+πa<0 and b>0atanba+πa<0 and b<0π/2a=0 and b>0π/2a=0 and b<0

This is available in Julia:

We can rotate an arbitrary vector to the unit axis. Interestingly it only requires basic algebraic functions (no trigonometric functions):

 

Proposition (rotation of a vector) The matrix

Q=1a2+b2[abba]

is a rotation matrix satisfying

Q[ab]=a2+b2[10]

Proof

The last equation is trivial so the only question is that it is a rotation matrix. Define θ=atan(b,a). By definition of the two-arg arctan we have

[cosθsinθsinθcosθ]=[cos(θ)sin(θ)sin(θ)cos(θ)]=1a2+b2[abba].

 

Reflections

In addition to rotations, another type of orthognal matrix are reflections:

Definition (reflection matrix) Given a vector 𝐯 satisfying 𝐯=1, the reflection matrix is the orthogonal matrix

Q𝐯I2𝐯𝐯

These are reflections in the direction of 𝐯. We can show this as follows:

Proposition Q𝐯 satisfies:

  1. Symmetry: Q𝐯=Q𝐯
  2. Orthogonality: Q𝐯Q𝐯=I
  3. 𝐯 is an eigenvector of Q𝐯 with eigenvalue 1
  4. Q𝐯 is a rank-1 perturbation of I
  5. detQ𝐯=1

Proof

Property 1 follows immediately. Property 2 follows from

Q𝐯Q𝐯=Q𝐯2=I4𝐯𝐯+4𝐯𝐯𝐯𝐯=I

Property 3 follows since

Q𝐯𝐯=𝐯

Property 4 follows since 𝐯𝐯 is a rank-1 matrix as all rows are linear combinations of each other. To see property 5, note there is a dimension n1 space W orthogonal to 𝐯, that is, for all 𝐰W we have 𝐰𝐯=0, which implies that

Q𝐯𝐰=𝐰

In other words, 1 is an eigenvalue with multiplicity n1 and 1 is an eigenvalue with multiplicity 1, and thus the product of the eigenvalues is 1.

 

Example (reflection through 2-vector) Consider reflection through 𝐱=[1,2]. We first need to normalise 𝐱:

𝐯=𝐱𝐱=[1525]

Note this indeed has unit norm:

𝐯2=15+45=1.

Thus the reflection matrix is:

Q𝐯=I2𝐯𝐯=[11]25[1224]=15[3443]

Indeed it is symmetric, and orthogonal. It sends 𝐱 to 𝐱:

Q𝐯𝐱=15[3846]=𝐱

Any vector orthogonal to 𝐱, like 𝐲=[2,1], is left fixed:

Q𝐯𝐲=15[6483]=𝐲

Note that building the matrix Q𝐯 will be expensive (O(n2) operations), but we can apply Q𝐯 to a vector in O(n) operations using the expression:

Q𝐯𝐱=𝐱2𝐯(𝐯𝐱).

Just as rotations can be used to rotate vectors to be aligned with coordinate axis, so can reflections, but in this case it works for vectors in n, not just 2:

Definition (Householder reflection) For a given vector 𝐱, define the Householder reflection

Q𝐱±,H:=Q𝐰

for 𝐲=𝐱𝐞1+𝐱 and 𝐰=𝐲𝐲. The default choice in sign is:

Q𝐱H:=Q𝐱sign(x1),H.

Lemma (Householder reflection)

Q𝐱±,H𝐱=±𝐱𝐞1

Proof Note that

𝐲2=2𝐱22𝐱x1,𝐲𝐱=𝐱2𝐱x1

where x1=𝐞1𝐱. Therefore:

Q𝐱±,H𝐱=(I2𝐰𝐰)𝐱=𝐱2𝐲𝐱𝐲2(𝐱x1)=𝐱𝐲=±𝐱𝐞1.

Why do we choose the the opposite sign of x1 for the default reflection? For stability. We demonstrate the reason for this by numerical example. Consider 𝐱=[1,h], i.e., a small perturbation from 𝐞1. If we reflect to norm(𝐱)𝐞1 we see a numerical problem:

It didn't work! Even worse is if h = 0:

This is because y has large relative error due to cancellation from floating point errors in computing the first entry x[1] - norm(x). (Or has norm zero if h=0.) We avoid this cancellation by using the default choice:


Week 4: Decompositions and least squares

YouTube Lectures: Least squares and QR Gram-Schmidt and Reduced QR Householder and QR PLU Decomposition Cholesky Decomposition

We now look at several decompositions (or factorisations) of a matrix into products of structured matrices, and their use in solving least squares problems and linear systems. For a square or rectangular matrix Am×n with more rows than columns (mn), we consider:

  1. The QR decomposition
A=QR=[𝐪1||𝐪m]m×m[×××00]m×n

where Q is orthogonal (QQ=I, 𝐪jm) and R is right triangular, which means it is only nonzero on or to the right of the diagonal.

  1. The reduced QR decomposition
A=Q̂R̂=[𝐪1||𝐪n]m×n[×××]n×n

where Q has orthogonal columns (QQ=I, 𝐪jm) and R̂ is upper triangular.

For a square matrix we consider the PLU decomposition:

A=PLU

where P is a permutation matrix, L is lower triangular and U is upper triangular.

Finally, for a square, symmetric positive definite (𝐱A𝐱>0 for all 𝐱n, 𝐱0) matrix we consider the Cholesky decomposition:

A=LL

The importance of these decomposition for square matrices is that their component pieces are easy to invert on a computer:

A=PLUA1𝐛=U1L1P𝐛A=QRA1𝐛=R1Q𝐛A=LLA1𝐛=LL1𝐛

and we saw last lecture that triangular and orthogonal matrices are easy to invert when applied to a vector 𝐛, e.g., using forward/back-substitution. For rectangular matrices we will see that they lead to efficient solutions to the least squares problem: find 𝐱 that minimizes the 2-norm

A𝐱𝐛.

In this lecture we discuss the followng:

  1. QR and least squares: We discuss the QR decomposition and its usage in solving least squares problems.
  2. Reduced QR and Gram–Schmidt: We discuss computation of the Reduced QR decomposition using Gram–Schmidt.
  3. Householder reflections and QR: We discuss comuting the QR decomposition using Householder reflections.
  4. PLU decomposition: we discuss how the LU decomposition can be computed using Gaussian elimination, and the computation of the PLU decomposition via Gaussian elimination with pivoting.
  5. Cholesky decomposition: we introduce symmetric positive definite matrices and show that their LU decomposition can be re-interpreted as a Cholesky decomposition.
  6. Timings: we see the relative trade-off in speed between the different decompositions.

 

1. QR and least squares

Here we consider rectangular matrices with more rows than columns. A QR decomposition decomposes a matrix into an orthogonal matrix Q times a right triangular matrix R. Note the QR decomposition contains within it the reduced QR decomposition:

A=QR=[Q̂|𝐪n+1||𝐪m][R̂𝟎mn×n]=Q̂R̂.

We can use it to solve a least squares problem using the norm-preserving property (see PS3) of orthogonal matrices:

A𝐱𝐛=QR𝐱𝐛=R𝐱Q𝐛=[R̂𝟎mn×n]𝐱[Q̂𝐪n+1𝐪m]𝐛

Now note that the rows k>n are independent of 𝐱 and are a fixed contribution. Thus to minimise this norm it suffices to drop them and minimise:

R̂𝐱Q̂𝐛

This norm is minimisable if it is attained. Provided the column rank of A is full, R̂ will be invertible (Exercise: why is this?). Thus we have the solution

𝐱=R̂1Q̂𝐛

Example (quadratic fit) Suppose we want to fit noisy data by a quadratic

p(x)=p+px+px2

That is, we want to choose p,p,p at data samples x1,,xm so that the following is true:

p+pxk+pxk2fk

where fk are given by data. We can reinterpret this as a least squares problem: minimise the norm

[1x1x121xmxm2][ppp][f1fm]

We can solve this using the QR decomposition:

We can visualise the fit:

Note that \ with a rectangular system does least squares by default:

 

 

2. Reduced QR and Gram–Schmidt

How do we compute the QR decomposition? We begin with a method you may have seen before in another guise. Write

A=[𝐚1||𝐚n]

where 𝐚km and assume they are linearly independent (A has full column rank). Note that the column span of the first j columns of A will be the same as the first j columns of Q̂, as R̂ must be non-singular:

span(𝐚1,,𝐚j)=span(𝐪1,,𝐪j)

In other words: the columns of Q̂ are an orthogonal basis of the column span of A. To see this note that since is triangular we have

[𝐚1||𝐚j]=[𝐪1||𝐪j]R̂[1:j,1:j]

for all j. That is, if 𝐯span(𝐚1,,𝐚j) we have for 𝐜j

𝐯=[𝐚1||𝐚j]𝐜=[𝐪1||𝐪j]R̂[1:j,1:j]𝐜span(𝐪1,,𝐪j)

while if 𝐰span(𝐪1,,𝐪j) we have for 𝐝j

𝐰=[𝐪1||𝐪j]𝐝=[𝐚1||𝐚j]R̂[1:j,1:j]1𝐝span(𝐚1,,𝐚j).

It is possible to find an orthogonal basis using the Gram–Schmidt algorithm, We construct it via induction: assume that

span(𝐚1,,𝐚j1)=span(𝐪1,,𝐪j1)

where 𝐪1,,𝐪j1 are orthogonal:

𝐪k𝐪=δk={1k=0otherwise.

for k,<j. Define

𝐯j:=𝐚jk=1j1𝐪k𝐚jrkj𝐪k

so that for k<j

𝐪k𝐯j=𝐪k𝐚jk=1j1𝐪k𝐚jrkj𝐪k𝐪k=0.

Then we define

𝐪j:=𝐯j𝐯j.

which sastisfies 𝐪k𝐪j=δkj for kj.

We now reinterpret this construction as a reduced QR decomposition. Define rjj:=𝐯j Then rearrange the definition we have

𝐚j=[𝐪1||𝐪j][r1jrjj]

Thus

[𝐚1||𝐚j][r11r1jrjj]

That is, we are computing the reduced QR decomposition column-by-column. Running this algorithm to j=n completes the decomposition.

Gram–Schmidt in action

We are going to compute the reduced QR of a random matrix

The first column of is indeed a normalised first column of A:

We now determine the next entries as

And the third column is then:

(Note the signs may not necessarily match.)

We can clean this up as a simple algorithm:

Complexity and stability

We see within the for j = 1:n loop that we have O(mj) operations. Thus the total complexity is O(mn2) operations.

Unfortunately, the Gram–Schmidt algorithm is unstable: the rounding errors when implemented in floating point accumulate in a way that we lose orthogonality:

3. Householder reflections and QR

As an alternative, we will consider using Householder reflections to introduce zeros below the diagonal. Thus, if Gram–Schmidt is a process of triangular orthogonalisation (using triangular matrices to orthogonalise), Householder reflections is a process of orthogonal triangularisation (using orthogonal matrices to triangularise).

Consider multiplication by the Householder reflection corresponding to the first column, that is, for

Q1:=Q𝐚1H,

consider

Q1A=[×××××××]=[r11r12r1n𝐚21𝐚n1]

where

r1j:=(Q1𝐚j)[1]and𝐚j1:=(Q1𝐚j)[2:m],

noting r11=sign(a11)𝐚1 and all entries of 𝐚11 are zero (thus not included). That is, we have made the first column triangular.

But now consider

Q2:=[1Q𝐚21H]=Q[0𝐚21]H

so that

Q2Q1A=[×××××××××]=[r11r12r13r1nr22r23r2n𝐚32𝐚n2]

where

r2j:=(Q2𝐚j1)[1]and𝐚j2:=(Q2𝐚j1)[2:m1]

Thus the first two columns are triangular.

The inductive construction is thus clear. If we define 𝐚j0:=𝐚j we have the construction

Qj:=[Ij1×j1Q𝐚jj±,H]𝐚jk:=(Qk𝐚jk1)[2:mk+1]rkj:=(Qk𝐚jk1)[1]

Then

QnQ1A=[r11r1nrnn00]R

i.e.

A=Q1QnQR.

The implementation is cleaner. We do a naive implementation here:

Note because we are forming a full matrix representation of each Householder reflection this is a slow algorithm, taking O(n4) operations. The problem sheet will consider a better implementation that takes O(n3) operations.

Example We will now do an example by hand. Consider the 4×3 matrix

A=[121015182442410]

For the first column we have

Q1=I112[4022][4022]=13[1022030020212012]

so that

Q1A=[36915180006]

In this example the next column is already upper-triangular, but because of our choice of reflection we will end up swapping the sign, that is

Q2=[1111]

so that

Q2Q1A=[36915180006]

The final reflection is

Q3=[I2×2I[11][11]]=[å110110]

giving us

Q3Q2Q1A=[369151860]R

That is,

A=Q1Q2Q3R=13[1022030020122021]Q[369151860]R=13[102030201202]Q̂[36915186]R̂

 

2. PLU Decomposition

Just as Gram–Schmidt can be reinterpreted as a reduced QR decomposition, Gaussian elimination with pivoting can be interpreted as a PLU decomposition.

Special "one-column" lower triangular matrices

Consider the following set of n×n lower triangular matrices which equals identity apart from one-column:

Lj:={I+[𝟎j𝐥j]𝐞j:𝐥jnj}

where 𝟎j denotes the zero vector of length j. That is, if LjLj then it is equal to the identity matrix apart from in the j th column:

Lj=[11j+1,j1n,j1]=

These satisify the following special properties:

Proposition (one-column lower triangular inverse) If LjLj then

Lj1=I[𝟎j𝐥j]𝐞j=[11j+1,j1n,j1]Lj.

Proposition (one-column lower triangular multiplication) If LjLj and LkLk for kj then

LjLk=I+[𝟎j𝐥j]𝐞j+[𝟎k𝐥k]𝐞k

Lemma (one-column lower triangular with pivoting) If σ is a permutation that leaves the first j rows fixed (that is, σ= for j) and LjLj then

PσLj=L~jPσ

where L~jLj.

Proof Write

Pσ=[IjPτ]

where τ is the permutation with Cauchy notation

(1njσj+1jσnj)

Then we have

PσLj=Pσ+[𝟎jPτ𝐥j]𝐞j=(I+[𝟎jPτ𝐥j]𝐞j)L~jPσ

noting that 𝐞jPσ=𝐞j (as σj=j).

LU Decomposition

Before discussing pivoting, consider standard Gaussian elimation where one row-reduces to introduce zeros column-by-column. We will mimick the computation of the QR decomposition to view this as a triangular triangularisation.

Consider the matrix

L1=[1a21a111an1a111]=I[0𝐚1[2:n]𝐚1[1]]𝐞1.

We then have

L1A=[u11u12u1n𝐚21𝐚n1]

where 𝐚j1:=(L1𝐚j)[2:n] and u1j=a1j. But now consider

L2:=I[0𝐚21[2:n1]𝐚21[1]]𝐞1.

Then

L2L1A=[u11u12u13u1nu22u23u2n𝐚32𝐚n2]

where

u2j:=(𝐚j1)[1]and𝐚j2:=(L2𝐚j1)[2:n1]

Thus the first two columns are triangular.

The inductive construction is again clear. If we define 𝐚j0:=𝐚j we have the construction

Lj:=I[𝟎j𝐚j+1j[2:nj]𝐚j+1j[1]]𝐞j𝐚jk:=(Lk𝐚jk1)[2:nk+1]ukj:=(Lk𝐚jk1)[1]

Then

Ln1L1A=[u11u1nunn]U

i.e.

A=L11Ln11LU.

Writing

Lj=I+[𝟎jj+1,jn,j]𝐞j

and using the properties of inversion and multiplication we therefore deduce

L=[121131321n1n2n,n11]

 

Example (computer) We will do a numerical example (by-hand is equivalent to Gaussian elimination). The first lower triangular matrix is:

Which indeed introduces zeros in the first column:

Now we make the next lower triangular operator:

So that

The last one is:

Giving us

and

Note how the entries of L are indeed identical to the negative lower entries of L₁, L₂ and L₃.

Example (by-hand)

Consider the matrix

A=[111248149]

We have

L2L1A=L2[12111][111248149]=[11321][1112638]=[111261]U

We then deduce L by taking the negative of the lower entries of L1,L2:

L=[1211321].

PLU Decomposition

We learned in first year linear algebra that if a diagonal entry is zero when doing Gaussian elimnation one has to row pivot. For stability, in implementation one always pivots: swap the largest in magnitude entry for the entry on the diagonal. In terms of a decomposition, this leads to

Ln1Pn1P2L1P1A=U

where Pj is a permutation that leaves rows 1 through j1 fixed, and swaps row j with a row kj whose entry is maximal in absolute value.

Thus we can deduce that

Ln1Pn1P2L1P1=Ln1L~n2L~1L1Pn1P2P1P.

where the tilde denotes the combined actions of swapping the permutation and lower-triangular matrices, that is,

Pn1Pj+1Lj=L~jPn1Pj+1.

where L~jLj. The entries of L are then again the negative of the entries below the diagonal of Ln1,L~n2,,L~1.

Writing

L~j=I+[𝟎j~j+1,j~n,j]𝐞j

and using the properties of inversion and multiplication we therefore deduce

L=[1~211~31~321~n1,1~n1,2~n1,n21~n1~n2~n,n2n,n11]

 

Example

Again we consider the matrix

A=[111248149]

Even though a11=10, we still pivot: placing the maximum entry on the diagonal to mitigate numerical errors. That is, we first pivot and upper triangularise the first column:

L1P1A=L1[01101][111248149]=[1121121][248111149]

We now pivot and upper triangularise the second column:

L2P2L1P1A=L2[10110][248013025]=[11121][248025013]=[2480250012]U

We now move P2 to the right:

L2P2L1P1=[112112121]L2L~1[010001100]P

That is

L=[112112121]

We see how this example is done on a computer:

The permutation is

Thus to invert a system we can do:

Note the entries match exactly because this precisely what \ is using.

3. Cholesky Decomposition

Cholesky Decomposition is a form of Gaussian elimination (without pivoting) that exploits symmetry in the problem, resulting in a substantial speedup. It is only relevant for symmetric positive definite matrices.

Definition (positive definite) A square matrix An×n is positive definite if for all 𝐱n,x0 we have

𝐱A𝐱>0

First we establish some basic properties of positive definite matrices:

Proposition (conj. pos. def.) If An×n is positive definite and Vn×n is non-singular then

VAV

is positive definite.

Proposition (diag positivity) If An×n is positive definite then its diagonal entries are positive: akk>0.

Theorem (subslice pos. def.) If An×n is positive definite and 𝐤{1,,n}m is a vector of m integers where any integer appears only once, then A[𝐤,𝐤]m×m is also positive definite.

 

We leave the proofs to the problem sheets. Here is the key result:

Theorem (Cholesky and sym. pos. def.) A matrix A is symmetric positive definite if and only if it has a Cholesky decomposition

A=LL

where the diagonals of L are positive.

Proof If A has a Cholesky decomposition it is symmetric (A=(LL)=A) and for 𝐱0 we have

𝐱A𝐱=(L𝐱)L𝐱=L𝐱2>0

where we use the fact that L is non-singular.

For the other direction we will prove it by induction, with the 1×1 case being trivial. Write

A=[α𝐯𝐯K]=[α𝐯αI]L1[1K𝐯𝐯α]A1[α𝐯αI]L1.

Note that K𝐯𝐯α is a subslice of A1=L11AL1, hence by the previous propositions is itself symmetric positive definite. Thus we can write

K𝐯𝐯α=L~L~

and hence A=LL for

L=L1[1L~].

satisfies A=LL.

Note hidden in this proof is a simple algorithm form computing the Cholesky decomposition. We define

A1:=A𝐯k:=Ak[2:nk+1,1]αk:=Ak[1,1]Ak+1:=Ak[2:nk+1,2:nk+1]𝐯k𝐯kαk.

Then

L=[α1𝐯1[1]α1α2𝐯1[2]α1𝐯2[1]α2α2𝐯1[n1]α1𝐯2[n2]α2𝐯n1[1]αn1αn]

This algorithm succeeds if and only if A is symmetric positive definite.

Example Consider the matrix

A0=A=[2111121111211112]

Then

A1=[211121112]12[111][111]=12[311131113]

Continuing, we have

A2=12([3113]13[11][11])=13[4114]

Finally

A3=54.

Thus we get

L=L1L2L3=[21232121623121611252]

4. Timings

The different decompositions have trade-offs between speed and stability. First we compare the speed of the different decompositions on a symmetric positive definite matrix, from fastest to slowest:

On my machine, cholesky is ~1.5x faster than lu,
which is ~2x faster than QR.

 

In terms of stability, QR computed with Householder reflections (and Cholesky for positive definite matrices) are stable, whereas LU is usually unstable (unless the matrix is diagonally dominant). PLU is a very complicated story: in theory it is unstable, but the set of matrices for which it is unstable is extremely small, so small one does not normally run into them.

Here is an example matrix that is in this set.

Note that pivoting will not occur (we do not pivot as the entries below the diagonal are the same magnitude as the diagonal), thus the PLU Decomposition is equivalent to an LU decomposition:

But here we see an issue: the last column of U is growing exponentially fast! Thus when n is large we get very large errors:

Note qr is completely fine:

Amazingly, PLU is fine if applied to a small perturbation of A:

 

The big open problem in numerical linear algebra is to prove that the set of matrices for which PLU fails has extremely small measure.


Week 5: Singular values and conditioning

YouTube Lectures: Matrix Norms Singular Value Decomposition

In this lecture we discuss matrix and vector norms. The matrix 2-norm involves singular values, which are a measure of how matrices "stretch" vectors. We also show that the singular values of a matrix give a notion of a condition number, which allows us to bound errors introduced by floating point numbers in linear algebra operations.

  1. Vector norms: we discuss the standard p-norm for vectors in n.
  2. Matrix norms: we discuss how two vector norms can be used to induce a norm on matrices. These satisfy an additional multiplicative inequality.
  3. Singular value decomposition: we introduce the singular value decomposition which is related to the matrix 2-norm and best low rank approximation.
  4. Condition numbers: we will see how errors in matrix-vector multiplication and solving linear systems can be bounded in terms of the condition number, which is defined in terms of singular values.

1. Vector norms

Recall the definition of a (vector-)norm:

Definition (vector-norm) A norm on n is a function that satisfies the following, for 𝐱,𝐲n and c:

  1. Triangle inequality: 𝐱+𝐲𝐱+𝐲
  2. Homogeneity: c𝐱=|c|𝐱
  3. Positive-definiteness: 𝐱=0 implies that 𝐱=0.

Consider the following example:

Definition (p-norm) For 1p< and 𝐱n, define the p-norm:

𝐱p:=(k=1n|xk|p)1/p

where xk is the k-th entry of 𝐱. For p= we define

𝐱:=maxk|xk|

Theorem (p-norm) p is a norm for 1p.

Proof

We will only prove the case p=1,2, as general p is more involved.

Homogeneity and positive-definiteness are straightforward: e.g.,

c𝐱p=(k=1n|cxk|p)1/p=(|c|pk=1n|xk|p)1/p=|c|𝐱

and if 𝐱p=0 then all |xk|p are have to be zero.

For p=1, the triangle inequality is also straightforward:

𝐱+𝐲=maxk(|xk+yk|)maxk(|xk|+|yk|)𝐱+𝐲

and

𝐱+𝐲1=k=1n|xk+yk|k=1n(|xk|+|yk|)=𝐱1+𝐲1

For p=2 it can be proved using the Cauchy–Schwartz inequality:

|𝐱𝐲|𝐱2𝐲2

That is, we have

𝐱+𝐲2=𝐱2+2𝐱𝐲+𝐲2𝐱2+2𝐱𝐲+𝐲2=(𝐱+𝐲)

In Julia, one can use the inbuilt norm function to calculate norms:

2. Matrix norms

Just like vectors, matrices have norms that measure their "length". The simplest example is the Fröbenius norm, defined for an m×n real matrix A as

AF:=k=1mj=1nAkj2

This is available as norm in Julia:

While this is the simplest norm, it is not the most useful. Instead, we will build a matrix norm from a vector norm:

 

Definition (matrix-norm) Suppose Am×n and consider two norms X on n and Y on n. Define the (induced) matrix norm as:

AXY:=sup𝐯:𝐯X=1A𝐯Y

Also define

AXAXX

For the induced 2, 1, and -norm we use

A2,A1andA.

Note an equivalent definition of the induced norm:

AXY=sup𝐱n,𝐱0A𝐱Y𝐱X

This follows since we can scale 𝐱 by its norm so that it has unit norm, that is, 𝐱𝐱X has unit norm.

Lemma (matrix norms are norms) Induced matrix norms are norms, that is for =XY we have:

  1. Triangle inequality: A+BA+B
  2. Homogeneneity: cA=|c|A
  3. Positive-definiteness: A=0A=0

In addition, they satisfy the following additional properties:

  1. A𝐱YAXY𝐱X
  2. Multiplicative inequality: ABXZAYZBXY

Proof

First we show the triangle inequality:

A+Bsup𝐯:𝐯X=1(A𝐯Y+B𝐯Y)A+B.

Homogeneity is also immediate. Positive-definiteness follows from the fact that if A=0 then A𝐱=0 for all 𝐱n. The property A𝐱YAXY𝐱X follows from the definition. Finally, the multiplicative inequality follows from

AB=sup𝐯:𝐯X=1AB𝐯|Zsup𝐯:𝐯X=1AYZB𝐯|=AYZBXY

 

We have some simple examples of induced norms:

Example (1-norm) We claim

A1=maxj𝐚j1

that is, the maximum 1-norm of the columns. To see this use the triangle inequality to find for 𝐱1=1

A𝐱1j=1n|xj|𝐚j1maxj𝐚jj=1n|xj|=maxj𝐚j1.

But the bound is also attained since if j is the column that maximises the norms then

A𝐞j1=𝐚j1=maxj𝐚j1.

In the problem sheet we see that

A=maxkA[k,:]1

that is, the maximum 1-norm of the rows.

Matrix norms are available via opnorm:

An example that does not have a simple formula is A2, but we do have two simple cases:

Proposition (diagonal/orthogonal 2-norms) If Λ is diagonal with entries λk then Λ2=maxk|λk|.. If Q is orthogonal then Q=1.

3. Singular value decomposition

To define the induced 2-norm we need to consider the following:

Definition (singular value decomposition) For Am×n with rank r>0, the reduced singular value decomposition (SVD) is

A=UΣV

where Um×r and Vr×n have orthonormal columns and Σr×r is diagonal whose diagonal entries, which which we call singular values, are all positive and decreasing: σ1σr>0. The full singular value decomposition (SVD) is

A=ŨΣ̃Ṽ

where Ũm×m and Ṽn×n are orthogonal matrices and Σ̃m×n has only diagonal entries, i.e., if m>n,

Σ̃=[σ1σn00]

and if m<n,

Σ̃=[σ1σm00]

where σk=0 if k>r.

To show the SVD exists we first establish some properties of a Gram matrix (AA):

Proposition (Gram matrix kernel) The kernel of A is the also the kernel of AA.

Proof If AA𝐱=0 then we have

0=𝐱TAA𝐱=A𝐱2

which means A𝐱=0 and 𝐱ker(A).

Proposition (Gram matrix diagonalisation) The Gram-matrix satisfies

AA=QΛQ

where Q is orthogonal and the eigenvalues λk are non-negative.

Proof AA is symmetric so we appeal to the spectral theorem for the existence of the decomposition. For the corresponding (orthonormal) eigenvector 𝐪k,

λk=λk𝐪k𝐪k=𝐪kAA𝐪k=A𝐪k20.

This connection allows us to prove existence:

Theorem (SVD existence) Every Am×n has an SVD.

Proof Consider

AA=QΛQ.

Assume (as usual) that the eigenvalues are sorted in decreasing modulus, and so λ1,,λr are an enumeration of the non-zero eigenvalues and

V:=[𝐪1||𝐪r]

the corresponding (orthonormal) eigenvectors, with

K=[𝐪r+1||𝐪n]

the corresponding kernel. Define

Σ:=[λ1λr]

Now define

U:=AVΣ1

which is orthogonal since AAV=VΣ2:

UU=Σ1VAAVΣ1=I.

Thus we have

UΣV=AVV=A[V|K]Q[VK]Q

where we use the fact that AK=0 so that concatenating K does not change the value.

Singular values tell us the 2-norm:

Corollary (singular values and norm)

A2=σ1

and if An×n is invertible, then

A12=σn1

Proof

First we establish the upper-bound:

A2U2Σ2V2=Σ2=σ1

This is attained using the first right singular vector:

A𝐯12=ΣV𝐯12=Σ𝐞12=σ1

The inverse result follows since the inverse has SVD

A1=VΣ1U=V(WΣ1W)U

is the SVD of A1, where

W:=Pσ=[11]

is the permutation that reverses the entries, that is, σ has Cauchy notation

(12nnn11).

We will not discuss in this module computation of singular value decompositions or eigenvalues: they involve iterative algorithms (actually built on a sequence of QR decompositions).

One of the main usages for SVDs is low-rank approximation:

Theorem (best low rank approximation) The matrix

Ak:=[𝐮1||𝐮k][σ1σk][𝐯1||𝐯k]

is the best 2-norm approximation of A by a rank k matrix, that is, for all rank-k matrices B, we have AAk2AB2.

Proof We have

AAk=U[00σk+1σr]V.

Suppose a rank-k matrix B has

AB2<AAk2=σk+1.

For all 𝐰ker(B) we have

A𝐰2=(AB)𝐰2AB𝐰2<σk+1𝐰2

But for all 𝐮span(𝐯1,,𝐯k+1), that is, 𝐮=V[:,1:k+1]𝐜 for some 𝐜k+1 we have

A𝐮22=UΣk𝐜22=Σk𝐜22=j=1k+1(σjcj)2σk+12c2,

i.e., A𝐮2σk+1c. Thus 𝐰 cannot be in this span.

The dimension of the span of ker(B) is at least nk, but the dimension of span(𝐯1,,𝐯k+1) is at least k+1. Since these two spaces cannot intersect we have a contradiction, since (nr)+(r+1)=n+1>n. ∎

 

Here we show an example of a simple low-rank approximation using the SVD. Consider the Hilbert matrix:

That is, the H[k,j]=1/(k+j1). This is a famous example of matrix with rapidly decreasing singular values:

Note numerically we typically do not get a exactly zero singular values so the rank is always treated as min(m,n). Because the singular values decay rapidly we can approximate the matrix very well with a rank 20 matrix:

Note that this can be viewed as a compression algorithm: we have replaced a matrix with 1002=10,000 entries by two matrices and a vector with 4,000 entries without losing any information. In the problem sheet we explore the usage of low rank approximation to smooth functions.

 

4. Condition numbers

We have seen that floating point arithmetic induces errors in computations, and that we can typically bound the absolute errors to be proportional to Cϵm. We want a way to bound the effect of more complicated calculations like computing A𝐱 or A1𝐲 without having to deal with the exact nature of floating point arithmetic. Here we consider only matrix-multiplication but will make a remark about matrix inversion.

To justify what follows, we first observe that errors in implementing matrix-vector multiplication can be captured by considering the multiplication to be exact on the wrong matrix: that is, A*x (implemented with floating point) is precisely A+δA where δA has small norm, relative to A. This is known as backward error analysis.

 

To discuss floating point errors we need to be precise which order the operations happened. We will use the definition mul(A,x), which denote mul(A,𝐱). (Note that mul_rows actually does the exact same operations, just in a different order.) Note that each entry of the result is in fact a dot-product of the corresponding rows so we first consider the error in the dot product dot(𝐱,𝐲) as implemented in floating-point, which we denote dot(A,x).

We first need a helper proposition:

Proposition If |ϵi|ϵ and nϵ<1, then

k=1n(1+ϵi)=1+θn

for some constant θn satisfying |θn|nϵ1nϵ.

The proof is left as an exercise (Hint: use induction).

Lemma (dot product backward error) For 𝐱,𝐲n,

dot(𝐱,𝐲)=(𝐱+δ𝐱)𝐲

where

|δ𝐱|nϵm2nϵm|𝐱|,

where |𝐱| means absolute-value of each entry.

Proof

Note that

dot(𝐱,𝐲)={[(x1y1)(x2y2)](x3y3)]}(xnyn)={[(x1y1)(1+δ1)+(x2y2)(1+δ2)](1+γ2)+x3y3(1+δ3)](1+γ3)++xnyn(1+δn)}(1+γn)=j=1nxjyj(1+δj)k=jn(1+γk)=j=1nxjyj(1+θj)

where we denote the errors from multiplication as δk and those from addition by γk (with γ1=0). Note that θj each have at most n terms each bounded by ϵm/2, Thus the previous proposition tells us

|θj|nϵm2nϵm.

Thus

δ𝐱=(x1θn1x2θn2x3θn1xnθ1)

and the theorem follows from homogeneity:

δ𝐱nϵm2nϵm𝐱

Theorem (matrix-vector backward error) For Am×n and 𝐱n we have

mul(A,𝐱)=(A+δA)𝐱

where

|δA|nϵm2nϵm|A|.

Therefore

δA1nϵm2nϵmA1δA2min(m,n)nϵm2nϵmA2δAnϵm2nϵmA

Proof The bound on |δA| is implied by the previous lemma. The 1 and -norm follow since

A1=|A|1 and A=|A|

This leaves the 2-norm example, which is a bit more challenging as there are matrices A such that A2|A|2. Instead we will prove the result by going through the Fröbenius norm and using:

A2AFrA2

where r is rank of A (see PS5) and AF=|A|F, so we deduce:

δA2δAF=|δA|Fnϵm2nϵm|A|F=nϵm2nϵmAFrnϵm2nϵmA2min(m,n)nϵm2nϵmA2

So now we get to a mathematical question independent of floating point: can we bound the relative error in approximating

A𝐱(A+δA)𝐱

if we know a bound on δA? It turns out we can in turns of the condition number of the matrix:

Definition (condition number) For a square matrix A, the condition number (in p-norm) is

κp(A):=ApA1p

with the 2-norm:

κ2(A)=σ1σn.

Theorem (relative-error for matrix-vector) The worst-case relative error in A𝐱(A+δA)𝐱 is

δA𝐱A𝐱κ(A)ε

if we have the relative pertubation error δA=Aε.

Proof We can assume A is invertible (as otherwise κ(A)=). Denote 𝐲=A𝐱 and we have

𝐱A𝐱=A1𝐲𝐲A1

Thus we have:

δA𝐱A𝐱δAA1κ(A)δAA

Thus for floating point arithmetic we know the error is bounded by κ(A)nϵm2nϵm.

If one uses QR to solve A𝐱=𝐲 the condition number also gives a meaningful bound on the error. As we have already noted, there are some matrices where PLU decompositions introduce large errors, so in that case well-conditioning is not a guarantee (but it still usually works).


Week 6: Differential Equations via Finite Differences

YouTube Lectures: Condition Numbers Indefinite integration via Finite Differences Euler Methods Poisson Equation Convergence of Finite Differences

We now see our first application: solving differential equations. We will focus on the following differential equations:

  1. Indefinite integration for axb:
u(a)=c,u=f(x)
  1. Linear time-evolution problems for 0tT:
u(0)=c,ua(t)u=f(t)
  1. Vector time-evolution problems for 0tT:
𝐮(0)=𝐜,𝐮A(t)𝐮=𝐟(t)
  1. Nonlinear time-evolution problems for 0tT:
𝐮(0)=𝐜,𝐮=f(t,𝐮(t))
  1. Poisson equation with Dirichlet conditions for axb:
u(a)=c0,u(b)=c1,u=f(x)

Our approach to solving these is to

  1. Approximate the solution on [a,b] evaluated on a n-point grid x1,,xn (labelled tk for time-evolution problems) with step size
xk+1xk=h=ban1

for a vector 𝐮n that will be determined by solving a linear system:

[u(x1)u(xn)][u1un]𝐮
  1. Replace the derivatives with the finite-difference approximations (here mk=(xk+1xk)/2 is the mid-point of the grid):
u(xk)u(xk+1)u(xk)huk+1ukh(Forward-difference)u(mk)u(xk+1)u(xk)huk+1ukh(Central-difference)u(xk)u(xk)u(xk1)hukuk1h(Backward-difference)u(xk)u(xk+1)2u(xk)+uk1h2uk+12uk+uk1h2
  1. Recast the differential equation as a linear system whose solution is 𝐮, which we solve using numerical linear algebra. Add the initial/boundary conditions as extra rows to make sure the system is square.

Remark: (advanced) One should normally not need to implement these methods oneselves as there are packages available, e.g. DifferentialEquations.jl. Moreover Forward and Backward Euler are only the first baby steps to a wide range of time-steppers, with Runge–Kutta being one of the most successful. For example we can solve a simple differential equation like a pendulum u=sinu can be solved as follows (writing at a system u=v,v=sinu):

However, even in these automated packages one has a choice of different methods with different behaviour, so it is important to understand what is happening.

1. Time-evolution problems

In this section we consider the forward and backward Euler methods, which are based on forward and backward difference approximations to the derivative. In the problem sheet we will investigate a rule that that takes the average of the two (with significant benefits). We first discuss the simplest case of indefinite integration, where central differences is also applicable, then introduce forward and backward Euler for linear scalar, linear vector, and nonlinear differential equations.

Indefinite integration

We begin with the simplest differential equation on an interval [a,b]:

u(a)=cu(x)=f(x)

Using the forward-difference (which is the standard finite-difference) approximation we choose uku(xk) so that, for k=1,,n1:

f(xk)=u(xk)uk+1ukh=f(xk)

That is, where u satisfies the differential equation exactly, uk satisfies the difference equation exactly. We do not include k=n to avoid going outside our grid.

This condition can be recast as a linear system:

1h[1111]Dh𝐮f=[f(x1)f(xn1)]𝐟f

where the super-script f denotes that we are using forward differences (and is leading towards the forward Euler method). Here Dhn1,n so this system is not-invertible. Thus we need to add an extra row, coming from the initial condition: 𝐞1𝐮f=c, that is:

[𝐞1Dh]𝐮f=[11/h1/h1/h1/h]Lh𝐮f=[c𝐟f]

This is a lower-triangular bidiagonal system, so can be solved using forward substitution in O(n) operations.

We can also consider discretisation at the mid-point mk=xk+1xk2, which is the analogue of using central-differences:

u(mk)uk+1ukh=f(mk)

That is, we have the exact same system with a different right-hand side:

1h[1111]Dh𝐮m=[f(m1)f(mn1)]𝐟m

And of course there is 𝐮B coming from the backwards-difference formula:

u(xk)ukuk1h=f(xk)

which we leave as an exercise.

Example

Let's do an example of integrating cosx, and see if our method matches the true answer of sinx. First we construct the system as a lower-triangular, Bidiagonal matrix:

We can now solve for our particular problem using both the left and mid-point rules:

They both are close though the mid-point version is significantly more accurate. We can estimate how fast it converges:

This is a log-log plot:we scale both x and y axes logarithmically so that nα becomes a straight line where the slope is dictated by α. We seem experimentally that the error for forward-difference is O(n1) while for mid-point/central-differences we get faster O(n2) convergence. Both methods appear to be stable.

Forward Euler

Now consider a scalar linear time-evolution problem for 0tT:

u(0)=cu(t)a(t)u(t)=f(t)

Label the n-point grid as tk=(k1)h for h=T/(n1).

Definition (Restriction matrices) Define the n1×n restriction matrices as

Inf:=[110]Inb:=[011]

Again we can replace the discretisation using finite-differences, giving us

uk+1ukha(tk)uk=f(uk)

for k=1,,n1. We need to add the term a(tk)uk to our differential equation, that is. We do this using the n1×n (left) restriction matrix that takes a vector evaluated at x1,,xn and restricts it to x1,,xn1, as well as the n×n multiplication matrix

An=[a(t1)a(tn)]

Putting everything together we have the system:

[𝐞1DhInfAn]𝐮f=[1a(t1)1/h1/ha(tn1)1/h1/h]L𝐮f=[cInf𝐟]

where 𝐟=[f(t1)f(tn)].

Here is a simple example for solving:

u(0)=1,u+tu=et

which has an exact solution in terms of a special error function (which we determined using Mathematica).

We see that it is converging to the true result.

Note that this is a simple forward-substitution of a bidiagonal system, so we can also just construct it directly:

u1=cuk+1=(1+ha(tk))uk+hf(tk)

Remark: (advanced) Note this can alternatively be reduced to an integral

u(t)=ceat+eat0tf(τ)eaτdτ

and solved as above but this approach is harder to generalise.

Backward Euler

In Backward Euler we replace the forward-difference with a backward-difference, that is

ukuk1ha(tk)uk=f(uk)

This leads to the system:

[𝐞1DhInbAn]𝐮b=[11/h1/ha(t2)1/h1/ha(tn)]L𝐮b=[cInb𝐟]

Again this is a bidiagonal forward-substitution:

u1=c(1ha(tk+1))uk+1=uk+hf(tk+1)

That is,

uk+1=(1ha(tk+1))1(uk+hf(tk+1))

 

Systems of equations

We can also solve systems, that is, equations of the form:

𝐮(0)=𝐜𝐮(t)A(t)𝐮(t)=𝐟(t)

where 𝐮,𝐟:[0,T]d and A:[0,T]d×d. We again discretise at the grid tk by approximating 𝐮(tk)𝐮kd. This can be reduced to a block-bidiagonal system as in the scalar case which is solved via forward-substitution. Though it's easier to think of it directly.

Forward Euler gives us:

𝐮1=c𝐮k+1=𝐮k+hA(tk)𝐮k+h𝐟(tk)

That is, each time-step consists of matrix-vector multiplication. On the other hand Backward Euler requires inverting a matrix at each time-step:

𝐮1=c𝐮k+1=(IhA(tk+1))1(𝐮k+h𝐟(tk+1))

Example (Airy equation) Consider the (negative-time) Airy equation:

u(0)=1u(0)=0u(t)+tu=0

We can recast it as a system by defining

𝐮(x)=[u(x)u(x)]

which satisfies

𝐮(0)=[10]𝐮[01t0]𝐮=𝟎.

It is natural to represent the time-slices 𝐮k as columns of a matrix U=[𝐮1||𝐮n]2×n. Thus we get:

We leave implementation of backward Euler as a simple exercise.

Example (Heat on a graph) Those who took Introduction to Applied Mathematics will recall heat equation on a graph. Consider a simple graph of m nodes labelled 1,,m where node k is connected to neighbouring nodes k1 and k+1, whereas node 1 is only connected to node 2 and node m only connected to m1. The graph Laplacian corresponding to this system is the matrix:

Δ:=[1112112111]

If we denote the heat at time t at node k as uk(t), which we turn into a vector

𝐮(t)=[u1(t)um(t)]

We consider the case of a periodic forcing at the middle node n=m/2.

Heat equation on this lattice is defined as follows:

𝐮=Δ𝐮+𝐞m/2cosωt

We can employ forward and backward Euler:

Both match!

Remark: If you change the number of time-steps to be too small, for example n = 100, forward Euler blows up while backward Euler does not. This will be discussed in the problem sheet.

Remark: (advanced) Memory allocations are very expensive so in practice one should preallocate and use memory.

Nonlinear problems

Forward-Euler extends naturally to nonlinear equations, including the vector case:

𝐮=f(t,𝐮(t))

becomes:

𝐮k+1=𝐮k+hf(xk,𝐮k)

Here we show a simple solution to a nonlinear Pendulum:

u=sinu

by writing 𝐮=[u,u] we have:

As we see it correctly predicts the oscillatory behaviour of a pendulum, and matches the simulation using DifferentialEquations.jl above.

 

2. Two-point boundary value problems

Here we will only consider one discretisation as it is symmetric:

u(xk)uk12uk+uk+1h2

That is we use the n1×n+1 matrix:

Dh2:=1h2[121121]

Example (Poisson) Consider the Poisson equation with Dirichlet conditions:

u(0)=c0u=f(x)u(1)=c1

which we discretise as

u0=c0uk12uk+uk+1h2=f(xk)u1=c1

As a linear system this equation becomes:

[𝐞1Dh2𝐞n+1]𝐮=[c0f(x2)f(xn1)c1]

Thus we solve:

We can test convergence on u(x)=cosx2 which satisfies

u(0)=1u(1)=cos1u(x)=4x2cos(x2)2sin(x2)

We observe uniform (-norm) convergence:

 

3. Convergence

We now study convergence of the approaches for the constant-coefficient case. We will use Toeplitz matrices as a tool to simplify the explanation.

Definition (Toeplitz) A Toeplitz matrix has constant diagonals: T[k,j]=akj.

Proposition (Bidiagonal Toeplitz inverse) The inverse of a n×n bidiagonal Toeplitz matrix is:

[1111]=[1121n121]

Theorem (Forward/Backward Euler convergence) Consider the equation

u(0)=c,u(t)+au(t)=f(t)

Denote

𝐮:=[u(t1)u(tn)]

Assume that u is twice-differentiable with uniformly bounded second derivative. Then the error for forward/backward Euler is

𝐮𝐮,𝐮𝐮=O(n1)

Proof We prove the error bound for forward Euler as backward Euler is similar. This proof consists of two stages: (1) consistency and (2) stability.

Consistency means our discretisation approximates the true equation, that is:

L𝐮=[cu(t2)u(t1)h+au(t1)u(tn)u(tn1)h+au(tn1)]=[cu(t1)+au(t1)+u(τ1)hu(tn1)+au(tn1)+u(τn1)h]=[cf(t1)+u(τ1)hf(tn1)+u(τn1)h]=[c𝐟]+[0δ]

where tkτktk+1, and uniform boundedness implies that δ=O(h), or in other words δ1=O(1).

Stability means the inverse does not blow up the error. We need to be a bit careful and first write, for =1+ha,

L=[1h1h1]D[111]L̃

Stability in this case is the statement that

L̃11(1+|a|/n)n1=O(1)

using the fact (which likely you have seen in first year) that

limn(1+|a|/n)n1=exp|a|.

We now combine stability and consistency. have

𝐮𝐮=L1(L𝐮L𝐮)=L̃1D1[0δ]hL̃11δ1=O(h).

Poisson

For 2D problems we consider Poisson. The first stage is to row-reduce to get a symmetric tridiagonal (pos. def.) matrix:

[11/h21111/h21][11/h22/h21/h21/h22/h21/h21]=[102/h21/h21/h22/h201].

Considering the right-hand side and dropping the first and last rows our equation becomes:

1h2[2112112]Δ[u2un1]=[f(x2)c0/h2f(x3)f(xn2)f(xn1)c1/h2]𝐟

Remark: (advanced) You may recognise Δ as a discrete Laplacian corresponding to a graph with Dirichlet conditions, as discussed in first year applied mathematics. Thus one can interpret finite-differences as approximating a continuous differential equation by a graph. This view-point extends naturally to higher-dimensional equations. In the problem sheet we also discuss Neumann series.

Theorem (Poisson convergence) Suppose that u is four-times differentiable with uniformly bounded fourth derivative. Then the finite difference approximation to Poisson converges like O(n2).

Proof

For consistency we need the Taylor series error of second-order finite differences. We have

u(x+h)=u(x)+hu(x)+h2u(x)2+h3u(x)6+h4u(4)(t+)24u(xh)=u(x)hu(x)+h2u(x)2h3u(x)6+h4u(4)(t)24

where t+[x,x+h] and t[xh,x]. Thus

u(x+h)2u(x)+u(xh)h2=u(x)+h2u(4)(t+)+u(4)(t)24=u(x)+O(h2)

(Note this is a slightly stronger result than used in the solution of PS2.) Thus we have consistency:

Δh2[u2un1]=𝐟+δ

where δ=O(h2).

Following PS5 we deduce that it has the Cholesky-like decomposition

Δ=[11111]U[11111]U

We can invert:

U1=[111111]

Thus we have stability: h2Δ1h2U1U=1.

Putting everything together we have, for 𝐮=[u(x1),,u(xn)],

𝐮𝐮=Δ1(Δ𝐮Δ𝐮)h2Δ1δ=O(n2)

What about the observed instability? The condition number of the matrix provides an intuition (though not a proof: condition numbers are only upper bounds!). Here we have

κ(Δ/h2)=κ(Δ)=ΔΔ12n2

One can show by looking at Δ1 directly that the condition number is indeed growing like n2. Thus we expect floating-point errors to magnified propotional to n2 in the linear solve.


Week 7: Fourier series

YouTube Lectures: Fourier Series Trapezium Rule and Fourier Coefficients

In Part III, Computing with Functions, we work with approximating functions by expansions in bases: that is, instead of approximating at a grid (as in the Differential Equations chapter), we approximate functions by other, simpler, functions. The most fundamental basis is (complex) Fourier series:

f(θ)=k=f̂eikθ

where

f̂:=12π02πf(θ)eikθdθ

In numerical analysis we try to build on the analogy with linear algebra as much as possible. Therefore we write this as:

f(θ)=[|e2iθ|eiθ|1|eiθ|e2iθ|]F(θ)[f̂2f̂1f̂0f̂1f̂2]𝐟̂

where the underline indicates the zero-index location.

More precisely, we are going to build an approximation using n approximate coefficients f̂knf̂k. We separate this into three cases:

  1. Odd: If n=2m+1 we approximate
f(θ)k=mmf̂neikθ=[eimθ||e2iθ|eiθ|1|eiθ|e2iθ||eimθ]Fm:m(θ)[f̂mnf̂mn]
  1. Even: If n=2m we approximate
f(θ)k=mm1f̂neikθ=[eimθ||e2iθ|eiθ|1|eiθ|e2iθ||ei(m1)θ]Fm:m1(θ)[f̂mnf̂m1n]
  1. Taylor: if we know the negative coefficients vanish (0=f̂1=f̂2=) we approximate
f(θ)k=0n1f̂neikθ=[1|eiθ|e2iθ||ei(n1)θ]F0:n1(θ)[f̂0nf̂n1n]

This can be thought of as an approximate Taylor expansion using the change-of-variables z=eiθ.

1. Basics of Fourier series

 

In analysis one typically works with continuous functions and relates results to continuity. In numerical analysis we inheritely have to work with vectors, so it is more natural to focus on the case where the Fourier coefficients f̂k are absolutely convergent, or in otherwords, the 1-norm of 𝐟̂ is bounded:

𝐟̂1=k=|f̂k|<

We first state a basic results (whose proof is beyond the scope of this module):

Theorem (convergence) If the Fourier coeffients are absolutely convergent then

f(θ)=k=f̂eikθ,

which converges uniformly.

Remark: (advanced) We also have convergence for the continuous version of the 2-norm,

f2:=02π|f(θ)|2dθ,

for any function such that f2<, but we won't need that in what follows.

Fortunately, continuity gives us sufficient (though not necessary) conditions for absolute convergence:

Proposition (differentiability and absolutely convergence) If f: and f are periodic and f is uniformly bounded, then its Fourier coefficients satisfy

𝐟̂<

Proof Integrate by parts twice using the fact that f(0)=f(2π), f(0)=f(2π):

f̂=02πf(θ)eikθdθ=[f(θ)eikθ]02π+1ik02πf(θ)eikθdθ=1ik[f(θ)eikθ]02π1k202πf(θ)eikθdθ=1k202πf(θ)eikθdθ

thus uniform boundedness of f guarantees |f̂|M|k|2 for some M, and we have

k=|f̂||f̂0|+2Mk=1|k|2<.

using the dominant convergence test.

This condition can be weakened to Lipschitz continuity but the proof is beyond the scope of this module. Of more practical importance is the other direction: the more times differentiable a function the faster the coefficients decay, and thence the faster Fourier series converges. In fact, if a function is smooth and 2π-periodic its Fourier coefficients decay faster than algebraically: they decay like O(kλ) for any λ. This will be explored in the problem sheet.

Remark: (advanced) Going further, if we let z=eiθ then if f(z) is analytic in a neighbourhood of the unit circle the Fourier coefficients decay exponentially fast. And if f(z) is entire they decay even faster than exponentially.

2. Trapezium rule and discrete Fourier coefficients

Let θj=2πj/n for j=0,1,,n denote n+1 evenly spaced points over [0,2π]. The Trapezium rule over [0,2π] is the approximation:

02πf(θ)dθ2πn[f(0)2+j=1n1f(θj)+f(2π)2]

But if f is periodic we have f(0)=f(2π) we get the periodic Trapezium rule:

02πf(θ)dθ2π1nj=0n1f(θj)Σn[f]

Define the Trapezium rule approximation to the Fourier coefficients by:

f̂kn:=Σn[f(θ)eikθ]=1nj=0n1f(θj)eikθj

Lemma (Discrete orthogonality) We have:

j=0n1eikθj={nk=,2n,n,0,n,2n,0otherwise

In other words,

Σn[ei(kj)θj]={1kj=,2n,n,0,n,2n,0otherwise.

Proof

Consider ω:=eiθ1=e2πin. This is an n th root of unity: ωn=1. Note that eiθj=e2πijn=ωj.

(Case 1: k=pn for an integer p) We have

j=0n1eikθj=j=0n1ωkj=j=0n1(ωpn)j=j=0n11=n

(Case 2 kpn for an integer p) Recall that

j=0n1zj=zn1z1.

Then we have

j=0n1eikθj=j=0n1(ωk)j=ωkn1ωk1=0.

where we use the fact that k is not a multiple of n to guarantee that ωk1.

Theorem (discrete Fourier coefficients) If 𝐟̂ is absolutely convergent then

f̂kn=+f̂k2n+f̂kn+f̂k+f̂k+n+f̂k+2n+

Proof

f̂kn=Σn[f(θ)eikθ]=j=f̂jΣn[f(θ)ei(jk)θ]=j=f̂j{1jk=,2n,n,0,n,2n,0otherwise

Note that there is redundancy:

Corollary (aliasing) For all p, f̂kn=f̂k+pnn.

In other words if we know f̂0n,,f̂n1n, we know f̂kn for all k via a permutation, for example if n=2m+1 we have

[f̂mnf̂mn]=[1111]Pσ[f̂0nf̂n1n]

where σ has Cauchy notation (Careful: we are using 1-based indexing here):

(12mm+1m+2nm+2m+3n12m+1).

 

We first discuss the case when all negative coefficients are zero, noting that the Fourier series is in fact a Taylor series if we let z=eiθ:

f(z)=k=0f̂kzk.

That is, f̂0n,,f̂n1n are approximations of the Taylor series coefficients by evaluating on the boundary.

 

We can prove convergence whenever of this approximation whenever f has absolutely summable coefficients. We will prove the result here in the special case where the negative coefficients are zero.

Theorem (Taylor series converges) If 0=f̂1=f̂2= and 𝐟̂ is absolutely convergent then

fn(θ)=k=0n1f̂kneikθ

converges uniformly to f(θ).

Proof

|f(θ)fn(θ)|=|k=0n1(f̂kf̂kn)eikθ+k=nf̂keikθ|=|k=nf̂k(eikθeimod(k,n)θ)|2k=n|f̂k|

which goes to zero as n.

For the general case we need to choose a range of coefficients that includes roughly an equal number of negative and positive coefficients (preferring negative over positive in a tie as a convention):

fn(θ)=k=n/2n/2f̂eikθ

In the problem sheet we will prove this converges provided the coefficients are absolutely convergent.

 

 

 

3. Discrete Fourier transform and interpolation

We note that the map from values to coefficients can be defined as a matrix-vector product using the DFT:

Definition (DFT) The Discrete Fourier Transform (DFT) is defined as:

Qn:=1n[11111eiθ1eiθ2eiθn11ei2θ1ei2θ2ei2θn11ei(n1)θ1ei(n1)θ2ei(n1)θn1]=1n[11111ω1ω2ω(n1)1ω2ω4ω2(n1)1ω(n1)ω2(n1)ω(n1)2]

for the n-th root of unity ω=eiπ/n. Note that

Qn=1n[11111eiθ1ei2θ1ei(n1)θ11eiθ2ei2θ2ei(n1)θ21eiθn1ei2θn1ei(n1)θn1]=1n[11111ω1ω2ω(n1)1ω2ω4ω2(n1)1ω(n1)ω2(n1)ω(n1)2]

That is, we have

[f0nfn1n]𝐟̂=1nQn[f(θ)f(θ)]𝐟

The choice of normalisation constant is motivated by the following:

Proposition (DFT is Unitary) Qn is unitary: QnQn=QnQn=I.

Proof

QnQn=[Σn[1]Σn[eiθ]Σn[ei(n1)θ]Σn[eiθ]Σn[1]Σn[ei(n2)θ]Σn[ei(n1)θ]Σn[ei(n2)θ]Σn[1]]=I

In other words, Qn is easily inverted and we also have a map from discrete Fourier coefficients back to values:

nQn𝐟̂=𝐟

Corollary (Interpolation) fn(θ) interpolates f at θj:

fn(θj)=f(θj)

Proof We have

fn(θj)=k=0n1f̂kneikθj=n𝐞jQn𝐟̂=𝐞jQnQn𝐟=f(θj).

We will leave extending this result to the problem sheet. Note that regardless of choice of coefficients we interpolate, though some interpolations are better than others:

We now demonstrate the relationship of Taylor and Fourier coefficients and their discrete approximations for some examples:

Example Consider the function

f(θ)=22eiθ

Under the change of variables z=eiθ we know for z on the unit circle this becomes (using the geometric series with z/2)

22z=k=0zk2k

i.e., f̂k=1/2k which is absolutely summable:

k=0|f̂k|=f(0)=2.

If we use an n point discretisation we get (using the geoemtric series with 2n)

f̂kn=f̂k+f̂k+n+f̂k+n+=p=012k+pn=2nk2n1

We can verify this numerically:

4. Fast Fourier Transform

Applying Q or its adjoint Qn to a vector naively takes O(n2) operations. Both can be reduced to O(nlogn) using the celebrated Fast Fourier Transform (FFT), which is one of the Top 10 Algorithms of the 20th Century (You won't believe number 7!).

The key observation is that hidden in Q2n are 2 copies of Qn. We will work with multiple n we denote the n-th root as ωn=exp(2π/n). Note that we can relate a vector of powers of ω2n to two copies of vectors of powers of ωn:

[1ω2nω2n2n1]ω2n=Pσ[Inω2nIn][1ωnωnn1]ωn

where σ has the Cauchy notation

(123nn+12n1352n122n)

That is, Pσ is the following matrix which takes the even entries and places them in the first n entries and the odd entries in the last n entries:

and so Pσ reverses the process. Thus we have

Q2n=12n[𝟏2n|ω2n|ω2n2||ω2n2n1]=12nPσ[𝟏nωnωn2ωnn1ωnnωn2n1𝟏nω2nωnω2n2ωn2ω2nn1ωnn1ω2nnωnnω2n2n1ωn2n1]=12Pσ[QnQnQnDnQnDn]=12Pσ[QnQn][InInDnDn]

In other words, we reduced the DFT to two DFTs applied to vectors of half the dimension.

We can see this formula in code:

Now assume n=2q so that log2n=q. To see that we get O(nlogn)=O(nq) operations we need to count the operations. Assume that applying Fn takes 3nq additions and multiplications. The first n rows takes n additions. The last n has n multiplications and n additions. Thus we have 6nq+3n6n(q+1)=3(2n)log2(2n) additions/multiplications, showing by induction that we have O(nlogn) operations.

 

Remark: The FFTW.jl package wraps the FFTW (Fastest Fourier Transform in the West) library, which is a highly optimised implementation of the FFT that also works well even when n is not a power of 2. (As an aside, the creator of FFTW Steven Johnson is now a Julia contributor and user.) Here we approximate exp(cos(θ0.1)) using 31 nodes:

Thus we have successfully approximate the function to roughly machine precision. The magic of the FFT is because it's O(nlogn) we can scale it to very high orders. Here we plot the Fourier coefficients for a function that requires around 100k coefficients to resolve:

We can use the FFT to compute some mathematical objects efficiently. Here is a simple example.

Example Define the following infinite sum (which has no name apparently, according to Mathematica):

Sn(k):=p=01(k+pn)!

We can use the FFT to compute Sn(0),,Sn(n1) in O(nlogn) operations. Consider

f(θ)=exp(eiθ)=k=0eiθk!

where we know the Fourier coefficients from the Taylor series of ez. The discrete Fourier coefficients satisfy for 0kn1:

f̂kn=f̂k+f̂k+n+f̂k+2n+=Sn(k)

Thus we have

[Sn(0)Sn(n1)]=1nQn[1exp(e2iπ/n)exp(e2i(n1)π/n)]